Pseudo-code for USGS Citation Project

# Pseudo-code for USGS Citation Project ## Environment Import packages, packages needed are: from pprintjson import pprintjson as ppjson for human readable results import pandas as pd to create dataframe for queried information from habanero import Crossref habanero is the package used to query CrossRef Database API for information on USGS DOI’s for data releases Both data released and publication DOIs are found within a spreadsheet, this can be uploaded as a csv or can be used in excel. If uploaded, additional packages must be imported Packages are: import os (find document on your operating system) import earthpy as et ## Workflow 1. Create a pathway to the document in order to upload it to the notebook. Once the spreadsheet is accessible, either in excel or as a dataframe within the notebook, results will need to be created in order to query the information from the crossref api. 2. From the spreadsheet, the publication DOIs must be entered in the crossref api in order to query results. Information will then need to be extracted from these results in order to build a dataframe to organize the gathered information. 3. Extract the information that will need to be extracted from each DOI will be: - [ ] Task for Taylor: List of columns to be created in table (replacing [original spreadsheet](https://github.com/earthlab/usgs-citations/blob/master/primary_related_pubs_data.csv)) - DOI of publication (`rel_pub_url`): all publications will have this - Year of publication: all publications will have this - Title of Article: all publications will have this - Publisher: all publications will have this - `usgs_pub` (was USGS the publisher): For USGS pubs, DOI will begin like this `https://doi.org/10.3133` (i.e., we can compute this based solely on the value of `rel_pub_url`) 4. Check whether reference/citation list exists in the API return. Add a `True` or `False` value for a column like `crossref_has_citation_info` - [ ] Taylor writing a function in python that takes one DOI for a related publication `rel_pub_url` as an argument, and returns `True` or `False` to indicate whether we get citation info from CrossRef, e.g., ```python def crossref_has_citations(doi): ... # you fill in this stuff return ... # this will be True or False ``` 5. If reference/citation info was returned, extract DOI’s from the references a loop will be needed to increase efficiency rather than exact one by one. 10.5066 data release DOI prefix for USGS will need to be implemented within the loop in order to extract only those DOIs as they are the ones we are interested in. Y/N of whether data release DOI was referenced In order to see whether the data release DOI was referenced, there must be a full reference list attached with related publication DOIs. When the information is extracted, a list will need to be made to store information. Once the list is made it can then be converted into a dataframe. ## Questions If the spreadsheet is uploaded to the notebook, can we convert that into a list and create a loop to query the inforation for each publication DOI that we want? What additional steps would need to be taken in order to convert that list into a dataframe?