# Pubmed Scraping
## Project Goal
Scrape Europe Pubmed https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=SRC:MED&cursorMark=*&format=json&pageSize=25&sort=FIRST_IDATE_D+desc&resultType=core for author and article data.
**Customizing URLs:**
Article Source: *[query=SRC:MED]* (refer to the source table below)
Page No: *[&pageSize=25]* (Max 1000 allowed)
Next Page: *[&cursorMark=]* (Catch the "cursorMark" from the first request and include it on the next, repeat it till the last page)
Query Response Format: *[&format=json]* ( json | xml )
Response Format for Authors:

## Output
Output needs to be suited for building a relational database that will be similar to "SciLeads".
1. Author module
- **Fields**: AuthorID, FirstName, middleInitial(s), LastName, Email, Phone, externalID(ORCID), OrganizationID, DeptID
2. Organization module
An organization is an entity that we can easily delimit and refer to. The following entities are typical organizations: University, Company, Government agency, Non-profit organization, Research institute (not affiliated with an university). Note that a university is an organization, but a department or a lab within that university is not an "organization" perse. In PubMed, "Affliation" corresponds to the concept of organization here.
- **Fields**: OrganizationID, organizationName, address, city, state (Province), zip, country
3. Department module
- **Fields**: DeptID, DeptName
4. Article module
- **Fields**: ArticleID, authorList, articleTitle, abstractText, pubYear, keywords, source, sourceurl, doi, pmid, pmcid, journalID, pageNumbers
5. Journal module
- **Fields**: JournalID, JournalTitle, Publisher, pISSN, eISSN
- List of journals can be obtained from https://europepmc.org/journalList?format=csv
## Discussion points/Questions
1. What is required to associate the different modules? ID-ID matching table?
- Author module will have the articleID to match

2. Should "Journal" be a separate module?
Yes, we need to have this journals list in a separate module
Pubmed does not assign a unique id to each journal, instead they use "pISSN" and "eISSN"
Journals are here https://europepmc.org/journalList

3. How many total articles are there in pumed europe?
These are the sources and record count
| Source | Count |
| -------- | -------- |
| AGR | 762,487 |
| CBA | 142,377 |
| CTX | 3,699 |
| ETH | 53,091 |
| HIR | 2,868 |
| MED | 30,955,010 |
| PAT | 4,229,297 |
| PMC | 616,064 |
| PPR | 136,139 |


4. Author and Institution should be two separate modules?
Yes, if it is possible to split the institution name and departments
5. In a database, what do you do when an entity has (for a lack of a better word) multiple properties associated with it? For example, a university has many departments, and a journal has many volumes/issues.
for departments: We split them into a different table (we call entitiy as table or tuple in mysql)
Example department table:
1 Medical
2 Engineering
3 Computer Studies
Example Org
1 UCSD
2 Pfizer
3 UCLA
Example Author
1 Karthick Org:1 Dept:1
2 John Org:1 Dept: 3
3 David Org:3 Dept: 3
for jounal volume / inssues: Include them in the author module along with the journalID --- I have decided not to keep volume/issue. That information is pretty much useless in this digital/internet era. --ok
6. What is the "pageNumbers" in Article Module? (Yes, I can see pageInfo "249-262")
Just a really minor nuance. In a journal issue, an article starts at a page and ends at a page. For example, page 10 to page 20. If not available in article metadata, then don't worry about it. Ok then.
7. Can you confirm if an article might have only one journal associated with it [or] there can be many? (I can see only 1 journal in all articles I have checked so far) --- Yes. An article is associated with one journal and only one. --ok
8. Are the Europe Pubmed data downloadable like the US PubMed?
Not all articles available for download. I have downloaded 3 files and each one have exactly 10000 articles. There are 276 files and so there are only 2.7million available for download. (I have not checked the US Pubmed, if all articles are available to download --- just checked a single file, it has 30000 articles, there are 1014 files - 30.4 million, so we can download (if needed) pubmed articles from US Pubmed)
By API, we can get all 30 million. --- OK
9. Is the current data scheme optimized for efficient and accurate queries
10. What
## Gavin Notes
Ok.
## Karthick Notes
US Pubmed Journals List: https://www.ncbi.nlm.nih.gov/pmc/journals/?format=csv
Each article have **one or many?** jourals associated with it
- if each article will have only one journal associated with it, then we need to include the journalID in article fields list
- if they can have multiple journals, then we need to use a separate table (articleID and JournalID as fields) to maintain the connections