LKD 2025: Discord Clouding v0.1 review

# Linux Kernel Design 2025: Discord Clouding v0.1 review > by < `RealBigMickey` > > Link to new project : **[Disfs](https://hackmd.io/yyvF_l0tSAa8U-uMczQ6zg)** A review and breakdown of a project I made exactly 1 year ago, **"Discord Clouding"**. `Goal: Recognize the faults and drawbacks of the old before starting a new.` [Full code on GITHUB (click me!)](https://github.com/RealBigMickey/Freshman-projects/tree/main/.Discord%20clouding%20v0.1) :::danger **TL;DR biggest problems:** - Problematic multi-user support - Problematic multi-file downloads - Server crashes every ~30 hrs or so (not mem leak) - Uploading files had no progress bar or indicator, for large files the page would seem like it froze (when it didn't and IS uploading the file) - No "real" realtime progress bar - No way to organize files - Sockets aren't protected at all - SQL was never really used properly, most data are done with just json files - Lots of poor design choices from the very beginning ::: ## Overview "Discord clouding" is a website/service that made use of a Discord bot and the permanent nature of message attachments, creating a service that's similiar to ['Gdrive'](https://drive.google.com/drive/my-drive) or ['Dropbox'](https://www.dropbox.com/). **This'll go through:** 1. Purpose & Roles: What each file is roughly responsible for. 2. How everything Works: A summary of each file's internal logic or contents. 3. Communication: How it interacts with other parts of the system (Flask server, Discord Bot, the database, etc.). 4. Data Structures & Algorithms: Data structures or algorithms used in my code or in the libraries called (e.g. Flask, SQLAlchemy, Discord.py, etc.). **Project structure:** ```vb └─.Discord clouding v0.1 └─ Bot └─ __pycache__ └─ file_logs └─ venv └─ .env └─ chunks_to_file.py └─ convert_to_chunks.py └─ doggo coding.gif └─ main.py └─ responses.py └─ downloads └─ uploads └─ Website └─ instance └─ User_database.db └─ website └─ __pycache__ └─ static └─ templates └─ auth.py └─ cleanup.py └─ models.py └─ views.py └─ download_queue.json └─ main.py └─ queue.json └─ ready_download.json └─ README.md └─ ready_download.json ``` ![image](https://hackmd.io/_uploads/HyHrzQ2TJx.png) > Very basic structure diagram in Chinese. ::: info NOTE: The discord bot will be referred to as José (the butler) ::: --- ## Bot Directory ### 1. `Bot/.env` - **Role**: Stores `DISCORD_TOKEN` for bot authentication. ### 2. `Bot/convert_to_chunks.py` - **Role**: Split large files into ≈ 10 MB chunks for Discord attachments. Writes chunks as 0001, 0002... so on. - **Key Snippet**: ```python BYTE_LIMIT = 1024 * 1024 - 100 # ≈ 10 MB def read_in_chunks(file): while True: data = file.read(BYTE_LIMIT) if not data: break yield data def write_file_in_chunks(f, temp_dir, file_name, file_path): total = int(os.path.getsize(file_path)/BYTE_LIMIT) + 1 folder = os.path.join(temp_dir, file_name) os.makedirs(folder, exist_ok=True) i = 1 for chunk in read_in_chunks(f): name = f"{i:04}.json" with open(os.path.join(folder,name), "wb") as f2: f2.write(chunk) i += 1 f.close() return total ``` :::warning **Problem❗** A temporary folder is created as "filename", with chunks stored as indexed .json files. But filenames AREN'T unique. So if a user downloads two files with the same name, there would be a collision. The system would overwrite json files, breaking the system. -> Use unique ID as the temp directory name BUT, unique file IDs are based on the user, so if multiple users were requesting downloads, it would also lead to collision. -> Use `"USER_ID:FILE_ID"` as the temp directory name ::: ### 3. `Bot/chunks_to_file.py` - **Role:** Reassemble chunk files into one output. - **Key Snippet:** ```python def reassemble_file(output, chunk_folder, ori_name): with open(output, "wb") as out_f: for fn in sorted(os.listdir(os.path.join(chunk_folder, ori_name))): with open(os.path.join(chunk_folder, ori_name, fn),"rb") as f: out_f.write(f.read()) ``` - **Data Structures:** Sorting chunk filename w/ Timsort -> O(n log n) ### 4. `Bot/responses.py` - **Role:** Provides chatbot replies for non‐file messages. - **Communication:** Called by main.py in on_message. - **Data structure:** makes string all lower-case then does simple string compares -> O(n) (shouldn't matter) ### 5. `Bot/file_logs/file_log1.json` I originally planned to rotate around 3 servers and spread files across evenly, but this was never implemented lol - **Role:** Map fileID to [ originalFileName, msgID1, msgID2, … ]. ```json // Example: { "7": ["test.pdf", (18-digit number), (18-digit number)] } ``` Messages are written by process_queue(), read by process_download_queue(). - **Data Structures:** JSON dict with string keys, loaded as Python dict[str, list]. :::warning **Problem❗** Using Json to log message IDs back-to-back locally is a terrible way to do things. Especially as JSON. - JSON is verbose (e.g. storing large numbers as strings instead of `unsigned long long`) - As large files are uploaded, lists of thousands of integers lead to large file sizes for file_log.json, causing: - waste of local storage - slow read and write speeds to a local harddrive The current way of logging files also isn't safe for concurrent read/write. No locks nor atomic ops were implemented. Solution -> Make use of SQL. Link message_ID by User_ID and unique file_ID > perhaps upload local SQL to the cloud as backup every so often ::: ### 6. `Bot/main.py` The meat and bones of José. - **Role:** Main entrypoint, runs a Flask server on port 2121 & the Discord client. - **Communication:** - Done by sending data and listening HTTPS POST - Web → Bot: POST `/notify-bot` triggers process_queue() or process_download_queue(). - Bot → Web: Calls POST `/notify-server` with JSON {'progress':…}, {'upload_complete':…}, or {'downloaded':…}. - Bot ↔ Discord: Uses `discord` library to send/fetch messages (upload/download chunks). **Discord Setup:** ```python load_dotenv() client = dc.Client(intents=dc.Intents.default()) Flask App to receive notifications from Web: app = Flask(__name__) @app.route('/notify-bot', methods=['POST']) def notify_upload(): data = request.get_json() if data.get('process_queue'): with open(QUEUE_PATH) as f: queue = json.load(f) client.loop.create_task(process_queue(queue)) return 'Upload notification received', 222 elif data.get('process_download_queue'): with open(DOWNLOAD_QUEUE_PATH) as f: dq = json.load(f) client.loop.create_task(process_download_queue(dq)) return 'Download notification received', 222 return 'Invalid request', 400 ``` :::warning **Problem❗** communication between the bot and server: - Used HTTP which is a text-based protocol - This was hitting a URL exposed by a web server - Malicious users could easily send false info or flood the channel It worked because HTTP runs over sockets. But it’s slower, heavier and isn't real-time. Just using sockets would've been better. Also the there's only 2 signals the reciever checks for, yet entire strings of data are sent, when just sending a single byte would've been MORE than enough. ::: ### 7. **Async Functions** (singals to the web server)**:** - notify_server_progress(ID, text): - requests.post(NOTI_SERVER_URL, json={'progress':[ID,text]}) inside run_in_executor. - Signal for current chunk progress `e.g. 31/42 downloaded` - notify_server_uploaded(ID, file_name): - requests.post(..., json={'upload_complete':ID}). - signal for upload completetion - notify_server_downloaded(ID, file_name): - requests.post(..., json={'downloaded':[ID,file_name]}). - signal for local download completion (ready to be sent to user) :::warning **Problem❗** Once again signals can be much shorter: - 1 byte for type - 4 bytes for ID (int) - 1 byte for progress (change to integer % display) ::: **process_queue(queue):** ```python For each fileName → ID in queue: Split with convert_to_chunks.write_file_in_chunks(...). Upload each chunk: folder = os.path.join("temp_upload", file_name) for chunk in sorted(os.listdir(folder)): num = int(chunk[:-5]) await notify_server_progress(ID, f"{num}/{total}") with open(os.path.join(folder,chunk),"rb") as f2: msg = await channel.send("", file=dc.File(f2, f"{file_name}#{num}")) message_ids.append(msg.id) ``` - Prepend file_name to message_ids, write file_log1.json. - await notify_server_uploaded(ID, file_name). - Delete local files and clear queue.json (via aiofiles). :::warning **Problem❗** José to sends a message on the server on every chunk download. Doing this in production would make flood the discord channel with messages, should only be used for DEBUGGING.. ::: **refresh_file_logs():** - To get the latest version of file_logs, this function is called and all `file_log{i}.json` are read again. ```python # e.g. path1 = os.path.join("Bot","file_logs", "file_log1.json") with open(path1, 'r') as f: file_log1 = json.load(f) ``` :::warning **Problem❗** - I/O operations are costly, this function is called for every single file, each call needs to read 3 external json files - There's no Concurrency Control. No mutex locks to prevent data races, causing errors in logs - As file logs grow large, it gets more and more costly to load & write - Full read into memory: JSON has to be fully deserialized. O(n). - Full write back to disk: whole file is rewritten. O(n). ::: **process_download_queue(dq):** - download queue is ALSO a json file - read, load, write operations at O(n) - Calls `refresh_file_logs()` (the ↑ function) - Linear scan across every file_log to find the correct file_log, then calls `download_from_message_id(file_log[id], id)` to finally handle the download :::warning **Problem❗** - Which file_log a file's msg data is stored can be decided by a function, like how a hashtable or hashmap. - instead of scanning through every file_log we can get the exact file_log by running a hash function - Same problem as before, constant reads and writes to an external file is costly. Concurrency is a mess. ::: **download_from_message_id(id_list, idStr):** - id_list = ["my.pdf", msgID1, msgID2, …]. - Create `temp_download/<fileName>/`. - For each msgID: ```python msg = await channel.fetch_message(msgID) attachment = msg.attachments[0] path = os.path.join("temp_download", fileName, attachment.filename) await notify_server_progress(idStr, f"{i}/{N}") await attachment.save(path) reassemble_file("Website/downloads/<fileName>", "temp_download", fileName). await notify_server_downloaded(idStr, fileName). ``` - Fetches the corresponding file from each msgID, downloads the data to a local directory ::: warning **Problem❗** - id_list.pop(0) mutates the caller’s data structure, which works here BUT is dangerous if id_list is a shared state - No concurrency logic: - Unsafe for multiple concurrent download_from_message_id calls run. - Not asyncio.Lock protected. - shutil.rmtree(PATH_TO_CHUNK_FOLDER) may they overwrite and delete each other’s temp data. ::: **Discord Event Handlers:** ```python @client.event async def on_ready(): print(f"{client.user} is online!") await client.get_channel(NOTIFICATION_CHANNEL_ID).send("Bot online") @client.event async def on_message(message): if message.author == client.user: return if message.content.startswith('!dog'): with open("doggo coding.gif","rb") as f: await message.channel.send("", file=dc.File(f, "doggo coding.gif")) else: await send_message(message, message.content) ``` **Run Both Servers:** ```python def run_flask(): app.run(host='0.0.0.0', debug=False, port=2121) def main(): threading.Thread(target=run_flask).start() client.run(token=TOKEN) if __name__ == "__main__": main() ``` **uploads/ & downloads/ (Empty Folders)** uploads/: - Blade to store incoming user-uploaded files before Bot splits them. - Bot deletes these files after chunking. downloads/: - Final destination for reassembled files after Bot downloads and joins chunks. - Served to the user via Flask’s send_file. ## Website Directory The part of the code that handles the website, handles user requests and communicates to the discord bot. ### 1. Website/main.py - Role: Entrypoint for the Flask app (port 1212). - Key Snippet: ```c app = Flask(__name__) app.config['SECRET_KEY'] = 'secret' app.config['SQLALCHEMY_DATABASE_URI'] = 'sqlite:///instance/User_database.db' db = SQLAlchemy(app) login_manager = LoginManager() login_manager.login_view = 'auth.login' login_manager.init_app(app) from website.models import User @login_manager.user_loader def load_user(user_id): return User.query.get(int(user_id)) from website.auth import auth as auth_bp app.register_blueprint(auth_bp) from website.views import views as views_bp app.register_blueprint(views_bp) if __name__ == "__main__": app.run(host='0.0.0.0', debug=True, port=1212) ``` - Communication: - Listens for user HTTP requests (login, signup, upload, download, etc.), and Bot callbacks at /notify-server. - Data Structures & Algorithms: - Flask: URL routing, request/response. - Flask-SQLAlchemy: ORM, B-tree indices in SQLite. - Flask-Login: Session cookie management. ### 2. Website/instance/User_database.db - Role: SQLite database for User, File, (and Note) tables. ```sql Schema: CREATE TABLE user ( id INTEGER PRIMARY KEY, email TEXT UNIQUE, password TEXT, Name TEXT ); CREATE TABLE file ( id INTEGER PRIMARY KEY, file_name TEXT, progress TEXT DEFAULT 'uploading...', download_ready INTEGER DEFAULT 0, upload_ready INTEGER DEFAULT 0, user_id INTEGER REFERENCES user(id), file_path TEXT, date DATETIME DEFAULT CURRENT_TIMESTAMP ); CREATE TABLE note ( id INTEGER PRIMARY KEY, data TEXT, date DATETIME DEFAULT CURRENT_TIMESTAMP, user_id INTEGER REFERENCES user(id) ); ``` - Algorithms: - SQLite B-tree indexing, SQL queries, SQLAlchemy ORM. ::: warning **Problem❗** - The progress indicator is a string that's litterally stored in the sql db - Waste of storage, can just be an 8 bit interger, or better yet excluded from the database all togther. - The process in charge of downloading should be able to communicate with the website through socket or other methods - **"notes"** are defined but never used, something that was scrapped along the way ::: ### 3. Website/website/auth.py - Role - Handles user login, logout, signup via Flask-Login and password hashing. - Key Snippets: ```python @auth.route("/login", methods=['GET','POST']) def login(): if request.method == 'POST': user = User.query.filter_by(email=request.form['email']).first() if user and check_password_hash(user.password, request.form['password']): login_user(user, remember=True) return redirect(url_for('views.home')) flash("Wrong credentials") return render_template("login.html") @auth.route("/sign-up", methods=['GET','POST']) def sign_up(): if request.method == 'POST': # Validate email, name, passwords match, length new_u = User( email=email, Name=Name, password=generate_password_hash(password1, method="pbkdf2:sha256") ) db.session.add(new_u); db.session.commit() login_user(new_u, remember=True) return redirect(url_for("views.home")) return render_template("sign_up.html") ``` - Communication: - Interacts with User model and db.session. - Uses flash() for error/success messages. - Data Structures & Algorithms: - generate_password_hash() - check_password_hash(): PBKDF2 with SHA-256. ### 4. Website/website/models.py - Role: Defines SQLAlchemy models for User, File, and Note. - Key Snippet: ```python class File(db.Model): id = db.Column(db.Integer, primary_key=True) file_name = db.Column(db.String(100)) progress = db.Column(db.String(100), default="uploading...") download_ready = db.Column(db.Integer, default=0) upload_ready = db.Column(db.Integer, default=0) user_id = db.Column(db.Integer, db.ForeignKey('user.id')) file_path = db.Column(db.String(200)) date = db.Column(db.DateTime(timezone=True), default=func.now()) ``` ::: warning **Problem❗** - Like stated above, progress is redundant. Communicating through a class is a concurrency hazard, and slow to read and write, albeit to memory and not an external file - file_name, file_path don't need to be such massive strings, waste of memory while running ::: ### 5. Website/website/views.py - Role: Core Flask routes for file upload, deletion, download, and Bot callbacks. - Globals & Setup: ```python DISCORD_BOT_URL = f"http://{get_ip()}:2121/notify-bot" QUEUE_FILE_PATH = "queue.json" DOWNLOAD_QUEUE_FILE_PATH = "download_queue.json" ready_download = {} with open("ready_download.json",'w') as f: json.dump(ready_download,f) download_requests = {} download_lock = Lock() UPLOAD_FOLDER, DOWNLOAD_FOLDER = 'uploads','downloads' os.makedirs(UPLOAD_FOLDER, exist_ok=True) os.makedirs(DOWNLOAD_FOLDER, exist_ok=True) PATH_TO_FILE_LOG = os.path.join(parent_dir, "Bot", "file_logs") ``` #### Home & Upload ("/") ```python @views.route("/", methods=["GET","POST"]) @login_required def home(): if request.method == "POST": f = request.files['file'] if not f or f.filename == '': flash("No file selected"); return redirect(request.url) path = os.path.join("uploads", f.filename) f.save(path) new_file = File(file_name=f.filename, file_path=path, user_id=current_user.id) db.session.add(new_file); db.session.commit() queue = json.load(open(QUEUE_FILE_PATH)) queue[f.filename] = new_file.id json.dump(queue, open(QUEUE_FILE_PATH,"w")) try: resp = requests.post(DISCORD_BOT_URL, json={'process_queue': True}) if resp.status_code == 222: flash("Upload queued!", "success") else: flash("Bot didn’t respond OK.", "warning") except Exception as e: flash(f"Bot notify failed: {e}", "warning") return redirect(url_for('views.home')) files = File.query.filter_by(user_id=current_user.id).all() return render_template("home.html", user=current_user, files=files) ``` - Communication: - Writes uploads/<fileName>, updates queue.json, notifies Bot. - Data Structures: - File row insertion; JSON dict for queue.json; requests.post(). ::: warning **Problem❗** - os.path.join() is called frequently throughout the project, but the path (like here) is static, thus can just be a constant - More concurrency issues: - Read-modify-write without a lock on queue. - More json non-sense ::: #### Delete File ("/delete-file") ```python @views.route("/delete-file", methods=["POST"]) def delete_file(): data = json.loads(request.data) f = File.query.get(data['fileId']) if f and f.user_id == current_user.id: for logname in ["file_log1.json","file_log2.json","file_log3.json"]: path = os.path.join(PATH_TO_FILE_LOG, logname) logs = json.load(open(path)) if str(f.id) in logs: logs.pop(str(f.id)) json.dump(logs, open(path, "w")) db.session.delete(f); db.session.commit() return jsonify({}) ``` ::: warning **Problem❗** - As stated above, use hash functions to decide which file_log index to store. - Same concurrency issues with json ::: #### Start Download ("/download-file/<file_id>") ```python @views.route("/download-file/<int:file_id>") @login_required def download_file(file_id): f = File.query.get(file_id) if f and f.user_id == current_user.id: f.progress = "downloading..."; db.session.commit() dq = json.load(open(DOWNLOAD_QUEUE_FILE_PATH)) dq[str(f.id)] = f.file_name json.dump(dq, open(DOWNLOAD_QUEUE_FILE_PATH,"w")) with download_lock: download_requests[f.file_name] = current_user.id session['download_file'] = [f.file_name, f.id] try: resp = requests.post(DISCORD_BOT_URL, json={'process_download_queue': True}) if resp.status_code == 222: print("Bot notified for download") except Exception as e: flash(f"Notify bot failed: {e}", "error") return redirect(url_for('views.home')) flash("Permission denied", "error") return redirect(url_for('views.home')) ``` - Communication: - Updates File.progress in DB, writes download_queue.json, notifies Bot. - Data Structures: - JSON dict for download queue; session cookie for download_file. ::: warning **Problem❗** Disregarding issues present above: - download_requests is global, data is lost if program crashes or restarts - Use a database - session['download_file'] -> 1. Session is User-Specific, but it stores a global State Example: Two users download the same file. Bot finishes downloading and notifies server. session['download_file'] = [file_name, file_id] is set, whichever one last processed /notify-server. Only one user sees the download. The other is stuck on "file not ready". ::: #### Notify Server ("/notify-server") ```python @views.route("/notify-server", methods=["POST"]) def notify_server(): data = request.get_json() if data.get('progress'): ID, txt = data['progress'] f = File.query.get(ID) if f: f.progress = txt; db.session.commit(); return 'OK',200 return 'Invalid',400 if data.get('upload_complete'): ID = data['upload_complete'] f = File.query.get(ID) if f: f.upload_ready = 1; db.session.commit(); return 'OK',200 return 'Invalid',400 if data.get('downloaded'): fid_str, fn = data['downloaded'] ready_download[fid_str] = fn json.dump(ready_download, open("ready_download.json","w")) user_id = download_requests.get(fn) if user_id: session['download_file'] = [fn, int(fid_str)] f = File.query.get(fid_str) if f: f.download_ready = 1; db.session.commit(); return 'OK',200 return 'Invalid',400 return 'Invalid request', 400 ``` - Communication: - Bot calls this to update progress, upload readiness, or download readiness. - Updates File table and ready_download.json. - Data Structures: - JSON dict, DB updates. #### Send File ("/send_file/<filename>") ```python @views.route("/send_file/<filename>") @login_required def send_file_route(filename): return send_file(os.path.join("downloads", filename), as_attachment=True) ``` - Role: Streams the final file to the user. - Communication: - Uses Flask’s send_file which employs streaming I/O. ### 6. Website/website/cleanup.py - Role: Simple code that deletes any files that aren't apart of `ready_download.json`. Runs on a seperate thread, checks with the interval `time.sleep(60)` :::warning **Problem❗** - Same problem with concurrency and json mess - Deletion doesn't necessarily need to be done on a seperate thread on a timer. Could be done upon new download requests or after a filesize threshold has been reached - Could be relagated to other parts of the code that's already running in the background, e.g. "José" ::: ### Static & Templates - static/: CSS, JS, images for the web UI. - templates/ (Jinja2 HTML): - login.html, sign_up.html: forms for authentication. - home.html: upload form + table of files (shows file.progress, “Download” if upload_ready==1, “Delete” button). - download.html: link to /send_file/<fileName>. - about_me.html, how_to_use*.html, logic.html: static informational pages. :::warning **Problem❗** When originally writing this project, I didn't know how to write Javascript and have realtime updates on a webpage. Everything was basic HTML How progress was displayed and updated was done by: - Caching images and other objects on the page - Saving the position of scrollbar - Refreshing the page automatically every 5s, simulating real time updates While cool, this was a temporary work around that never got fixed ::: - Data Structures & Algorithms: - Jinja2 template rendering compiles templates into bytecode; rendering is O(template_size + context_size). - Third-Party Libraries & Their Roles Flask: WSGI framework for routing and request/response. Internally uses a rule map (e.g., a prefix trie) for URL dispatch. Flask-SQLAlchemy / SQLAlchemy: ORM: translates Python model queries into SQL. SQLite backend uses a B-tree index on primary keys (id). Queries like query.get(id) are O(log N). Flask-Login: Manages user sessions via secure cookies and the @login_required decorator. Werkzeug: Provides FileStorage for request.files and generate_password_hash/check_password_hash (PBKDF2‐SHA256). discord.py: Async library for Discord. Maintains a WebSocket to the Gateway, uses aiohttp for REST calls. Methods .send()/.fetch_message() translate to HTTP requests under the hood. aiofiles: Asynchronous wrapper around file I/O; uses a thread pool to avoid blocking the asyncio loop. requests: Synchronous HTTP library. In Bot, wrapped via run_in_executor() to avoid blocking the event loop. Python Standard Library: os, json, shutil, socket, threading, etc. JSON parsing/dumping is O(n) in the size of the JSON. ## Basic flow for uploads and downloads - Upload Flow: - Flask: `User uploads` → `save to uploads/` → `insert File row` → `write queue.json` → `POST /notify-bot` - Bot: `Reads queue.json` → `splits file into chunks` → `uploads each chunk to Discord` → `updates file_log1.json` → `calls back to /notify-server?upload_complete` → `clears queue.json` - Download Flow - Flask: `User clicks “Download”` → `set File.progress` → `write download_queue.json` → `store session['download_file']` → `POST /notify-bot` - Bot: `Reads download_queue.json` → `looks up file_log1.json for chunk message IDs` → `fetches each via fetch_message()` → `saves into temp_download/` → `reassembles with chunks_to_file.py` → `writes Website/downloads/fileName` → `calls back /notify-server?downloaded` → `prunes download_queue.json` - Flask: `/check-download sees ready_download.json` → `renders download.html with link to /send_file/<fileName>` → `user downloads`