Blarchive- The Future of Internet Archival

# Blarchive- The Future of Internet Archival This document will cover the expected behavior of the blockchain archival application for development purposes. While attmepts are made to make this document exhaustive, a point of reference may be used as this application's main features will essentially mimic an already existing application on https://archive.is with the added integrity of hosting the text on the steem blockchain and options for different archival methods. ## Overview A. User submits a link to the site. Site checks previous archives to see if the link already has an archive, and provides a link to the user if one exists with an option to take a new snapshot if the user wants one. B. The link is submitted to archive.is and is scraped locally C. The bytes of the images/video/audio in the local scrape are added and multiplied by an amount needed to sustain hosting D. The user recieves an archive.is link and 3 options depending on what they want archived: * Only text for X * Text and images for Y amount, * Text, images, video, and audio for Z amount E. User can cycle through the 3 options to see what best suits their needs, with the UI reading from the local scrape cache to display each option in a "what you see is what you get" fashion F. After choosing the option that suits their needs, keychain or steemconnect will be used to process the payment, or the option to send to an account with an exact amount and a custom memo. G. Once payment has been received (if applicable), the archive in the local cache should be copied into a post/s on the steem blockchain, and a link to the archived page should be provided to the user. Further details of each step: A. Pretty self explanatory, but if more than one link has been previously saved, then a timeline of archives should be provided ordered by date. Also going to http://blockchainarchive.io/https://twitter.com/User should load all the archives containing that URL B. Archive.is archiving automation may or may not be possible, still waiting on word from the archive.is webmaster. The local scrape should save: - Text content of the web page. - Images. - Content of the frames. - Content and images loaded or generated by Javascript on Web 2.0 sites - Video and audio (if possible) Flash, ads, and trackers should be ignored by default, but may need to be loaded in order for the page to display, so the option to include them should be a available when the user is looking at the finished archive i.e. *If the page doesn't appear to be saved correctly, try again with ads and trackers included {button to try again}* We may also need to set a max size, but since they're paying for hosting, that may not be needed. C.- F. The three options for archival should be easily navigable possibly with tabs at the top of the screen for the user to preview. Each one will be priced according the size of data, and will be multiplied by a configurable rate in a settings file or menu. Getting the pricing right while still being sustainable will require some tweaking, so this needs to be able to change on a whim. Price per byte will be calculated for each tab, and the user can easily use steemconnect or keychain to pay for the archival on the page after reviewing. G. There should be a payment confirmed notification/pop-up once the payment has gone through, then a loading screen while the transfer to the chain takes place. Once the entire archive is on the chain, the page should redirect to the chain archive the user paid for. There should be a "report a problem" button somewhere on the page in case they don't get what they paid for. Other misc things: - Need a way to block IPs by admins to mitigate abuse - Should probably have a strict "no porn" policy and block known porn websites from being archived - Preferably use the openwayback code as a base to minimize exploits and dev overhead in the future - With steem's 64 kb post limit, some sites will need to be split into multiple posts but still displayed as one on the front end. -