Below is a “shopping list” of data sources + API hooks you can stitch together so your tool can sweep up education / teaching-and-learning scholarship almost as easily as it now hoovers arXiv for AI papers. I’ve grouped them by how open they are and what each brings to the table, then sketched one canonical pipeline that glues them together.
---
## 1. Open, discipline-agnostic mega-indexes
These behave like arXiv’s global firehose, but cover *all* fields and expose robust APIs.
| Service | What you get | How to query/filter | Typical rate limits | | |
| ------------------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------- |
| **OpenAlex** | 250 M works, unified *concept* taxonomy, author/institution graphs | `/works?search=“active learning”&concept.id=C2781132997` or combine \`filter=concept.id:... | is\_paratext\:false | from\_publication\_date:2015-01-01\` | 100 k req/day, no key needed ([OpenAlex Documentation][1]) |
| **CORE** | 30 M OA full-text PDFs harvested via OAI-PMH | REST: `/search/education AND (“higher education” OR pedagogy)` or harvest sets over OAI-PMH | 30 req/min unless you request a key ([CORE][2]) | | |
| **Semantic Scholar** | 200 M papers (abstract-level), good citation graph | `/graph/v1/paper/search?query=constructivism&fields=title,year,authors,openAccessPdf` | 100 req/5 min (free tier) ([Semantic Scholar][3]) | | |
| **Crossref + Unpaywall** | Raw DOI metadata for *everything* + OA PDF lookup | 1️⃣ hit Crossref `/works?query.bibliographic=…&filter=type:journal-article,from-pub-date:2020-01-01&rows=1000` 2️⃣ pass DOIs to Unpaywall `/v2/{doi}` | polite 1 req/s; ≥100 k/day OK ([www.crossref.org][4], [Unpaywall][5]) | | |
> **Tip**: OpenAlex already stores Unpaywall status, which saves you a second hop.
---
## 2. Education-specific open repositories & preprint servers
| Source | Why it matters | API / harvesting path |
| ------------------------------------------ | ----------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **ERIC** (Institute of Education Sciences) | The canonical U.S. education index (peer-reviewed journals *plus* grey literature, reports, dissertations). | REST: `https://api.eric.ed.gov/?search=“project-based learning”&apiKey=⟨key⟩&rows=200` (register once for free). PDF links often included. ([ERIC][6], [ERIC][7]) |
| **EdArXiv** | Education-research preprints (OSF). Great for cutting-edge, OA content. | OAI-PMH endpoint: `https://osf.io/preprints/edarxiv/oai?verb=ListRecords&metadataPrefix=oai_dc` or scrape the OSF JSON feed. ([OSF][8], [Center for Open Science][9]) |
| **PsyArXiv / SocArXiv** | Education psych & sociology angles; both OSF-hosted, same OAI-PMH pattern. | swap preprint server slug in URL. |
---
## 3. Directory or journal-level metadata feeds
*Use these to generate a whitelist of *journals* to watch, then poll Crossref or RSS each day.*
* **DOAJ** (Directory of Open Access Journals) lets you list every OA journal tagged *Education*, along with ISSNs. Their JSON API: `/search/journal?query=education&page=1&pageSize=100` returns ISSNs to feed Crossref. ([Directory of Open Access Journals][10])
* **Publons / Journal TOCs** (free RSS) for tables-of-contents alerts when new issues drop.
* **ISSN → RSS** hacks: Many publishers offer per-journal Atom feeds once you know the ISSN.
---
## 4. Licensed “big guns” (optional if Harvard provides tokens)
| Platform | Access model | Why bother |
| ------------------------------ | ------------------------------------- | ------------------------------------------------------------- |
| **Scopus API** (Elsevier) | Subscription; key via Harvard Library | Very complete citation data, author disambiguation |
| **Web of Science InCites API** | Subscription | Impact factors, citation context |
| **EBSCO EDS / ProQuest** | Subscription | Includes full-text education journals often missing elsewhere |
---
## 5. A reference pipeline you can clone
```mermaid
flowchart TD
search(OpenAlex / ERIC query queue) --> dedupe(Deduplicate by DOI)
dedupe --> oa(Unpaywall lookup)
oa -->|is_oa| fetch[Fetch PDF → S3]
oa -->|not_oa| proxy(Proxy via Harvard EZproxy)
fetch --> embed(Chunk & embed → vector DB)
proxy --> embed
```
### Steps
1. **Seed query builder**
* Build a UI that lets users pick: keyword + year range + “only peer-reviewed”.
* Translate to provider-specific parameters (OpenAlex `search + filter`, ERIC `search=` etc.).
2. **Metadata harvest & de-duplication**
* DOI, title, authors → keep canonical row.
* Save provider IDs for refresh.
3. **OA resolution & PDF fetch**
* If `is_oa=true` in OpenAlex **or** Unpaywall, download PDF.
* Else construct a library OpenURL so Harvard users still click through.
4. **Enrichment**
* Use Crossref funding body, field‐of‐study codes, and ERIC Thesaurus terms as tags.
* Store abstracts + citations for later RAG.
5. **Vectorisation & search**
* Embed in OpenAI (or Instructor for cheaper) → pgvector or Chroma.
* Surface in your UI with hybrid keyword + vector search.
---
## 6. Quick code snippets
**OpenAlex concept filter (Python):**
```python
import requests, json, pandas as pd
C_EDU = "C2775375081" # concept_id for "Higher Education"
url = f"https://api.openalex.org/works?per_page=200&filter=concept.id:{C_EDU},from_publication_date:2022-01-01"
df = pd.json_normalize(requests.get(url).json()['results'])
```
**ERIC search (Python):**
```python
BASE = "https://api.eric.ed.gov/"
params = {
"search": '"active learning" AND "STEM"',
"apiKey": "YOUR_KEY",
"rows": 100,
"format": "json"
}
results = requests.get(BASE, params=params).json()['response']['docs']
```
**Unpaywall resolution:**
```python
doi = "10.1080/00091383.2024.1234567"
r = requests.get(f"https://api.unpaywall.org/v2/{doi}", params={"email":"you@example.com"}).json()
pdf_url = r['best_oa_location']['url_for_pdf']
```
---
## 7. Wrapping it into your AI tool
* **Unified API layer**: abstract each provider into the same response schema (`title`, `year`, `doi`, `abstract`, `pdf_url`, `source`).
* **Caching**: OpenAlex and ERIC both timestamp records; store `updated_date` so daily cron jobs pull only deltas.
* **Throttling/back-off**: respect per-minute limits (especially Semantic Scholar).
* **Legal checks**: obey `is_peer_reviewed` flags in ERIC if you want to exclude lesson plans etc.
---
### TL;DR
There’s no single “edu-arXiv”, but by chaining **OpenAlex** (broad) + **ERIC** (education-specific) + **EdArXiv & friends** (preprints) and running every DOI through **Unpaywall** for free PDFs, you cover most peer-reviewed and grey literature—all with open, automation-friendly APIs. Everything else (Scopus, Web of Science) is icing you can layer on later using your Harvard credentials.
[1]: https://docs.openalex.org/quickstart-tutorial?utm_source=chatgpt.com "Quickstart tutorial | OpenAlex technical documentation"
[2]: https://core.ac.uk/?utm_source=chatgpt.com "CORE – Aggregating the world's open access research papers"
[3]: https://www.semanticscholar.org/product/api?utm_source=chatgpt.com "Semantic Scholar Academic Graph API"
[4]: https://www.crossref.org/documentation/retrieve-metadata/rest-api/?utm_source=chatgpt.com "REST API - Crossref"
[5]: https://unpaywall.org/products/api?utm_source=chatgpt.com "REST API - Unpaywall"
[6]: https://eric.ed.gov/?api=&utm_source=chatgpt.com "ERIC API Docs"
[7]: https://eric.ed.gov/pdf/Using_ERIC_API_for_Research_Topics.pdf?utm_source=chatgpt.com "[PDF] Using the ERIC API for Research Topics"
[8]: https://osf.io/preprints/edarxiv/?utm_source=chatgpt.com "EdArXiv Preprints - OSF"
[9]: https://www.cos.io/about/news/center-open-science-and-edarxiv-launch-branded-preprint-service-educational-research?utm_source=chatgpt.com "Center for Open Science and EdArXiv Launch Branded Preprint ..."
[10]: https://doaj.org/?utm_source=chatgpt.com "DOAJ: Directory of Open Access Journals"