Summary of Scrapping System Meeting

# Summary of Scrapping System Meeting >Arcadia Sprint Meeting > Feb/2020 > Marcus GUIDOTI & Johannes HAMANN We've split the task into four different steps: 1. Scrapping 2. Downloading the files 3. Sending them to batch process pipeline 4. Notifying people Step 1 will be done by publisher or journal specific python scrapping scripts, that Marcus will code. These scripts will always provide two standard files, one \*.txt with the links to be downloaded and how the files must be renamed, and one \*.txt with the logging of the scrapping. One idea was to create a renaming.py routine that will be applicable to all papers. The file names need to follow a standard, and they must be provided right next to the links, in the same \*.txt. Step 2-4 will be coded by Johannes, after attempting to download the files on the list of PDFs' links. In Step 2 we need to decide, whether this journal is behind a paywall and how to get arount this, so we might need to run the scraping throught a vpn or build credits for the paywall into the scappers There will be a general config file with the following options for each journal: Name, Url, Persons to notify, Paywall (boolean parameter), Processing Schedule (like weekly or daily etc.), the file name for the scrapper script to be used The notifications will come in three different moments: 1. Trigger a notification when a scrapper is ran, sending the log -> Only if there was an Error to report. Otherwise we'd just generate to much spam 2. Trigger a notification message with the amount of PDFs downloaded when a scrapper is ran (maybe also to much spam?) 3. Trigger a notification when there were PDFs batch processed A repository will be created to host all these scrappers and the other scripts. Johannes and Marcus will be the admin of this repo. It shall be created under plazi's github account. > We've to add Johannes into Plazi organization on GitHub. -> Done Ideas / Improvements after a night of sleep (and some talking to Guido) over this: Sceduling etc. will be done by Guidos Infrastructure, as he already implemented sceduling backends. Johannes will build the component, that integrates the scrappers into the environment and takes care of tunnel usage etc.