FazzFinancial Platform Engineering Interview Questions - Zak Chen

# FazzFinancial Platform Engineering Interview Questions - Zak Chen Candidate: Zak Chen ## 1. Deposit & Withdrawal Given the following Ruby on Rails program, talk about: ```ruby class TransactionController < ActionController::Base def deposit TransactionJob.perform_later( account_id: params[:account_id], amount: BigDecimal(params[:amount]), ) render json: {"ok" => true} end def withdrawal TransactionJob.perform_later( account_id: params[:account_id], amount: BigDecimal(params[:amount]) * -1, ) render json: {"ok" => true} end end # account big decimal class TransactionJob < ActiveJob::Base def perform(account_id:, amount:) account = Account.find(account_id) if account.balance + amount >= 0 # account.balance += amount account.save else send_alert("{account} does not have sufficient credit!") # send to client or other services end # we should have alerts when the balance goes less than 0 end end # CREATE TABLE `accounts` ( # `id` INT, # `balance` DECIMAL # ); class Account < ActiveRecord::Base end ``` ### performance issues Performance part, since the worker only receives request for both deposit and withdrawal, then submit/publish to queue service(sidekiq), so the producer side's performance should not be a issue. In the other hand, consumer(TransactionJob) might be blocked by concurrent writing to the same account's balance. ### concurrency issues When multiple consumer worker try to write to the same account, it hits the row lock since they are writing to the same field, so the concurrency number is limited by the number of accounts. To avoid the database service becoming the bottleneck of performance and concurrency, we may have a write-through cache layer, that when we can write/read from the cache when all the request comes in, and then write them periodically to database. If same account has very frequent access in a short time, to avoid it is blocking workers(consumers) to process others, so we can do hash ring over accoun_id, so that same account access request will always deliver to the same worker. Assuming we have 5 workers w1...w5 and the hash ring compute result of account: 1001 is w1, all request for account 1001 will deliver to w1. We may optimize the compute method to improve the distribution. When the whole worker's workload is getting higher, we may add/increase the worker pool, when this happen, we need to two things: 2-1: Re-calculate the hash ring for new coming requests 2-2: Re-balance for existing workload("Cassandra") ### user experience issues There is only one field will be returned by this service(wich is '{"ok": true}'), when there any error has been return by .perform_later, the client does not have the visiblility of the error so that there won't any chance to recover. Also, calling tech is not helpful since we do not leave any log with the service. So to improve, we should read error return from function calls and do: 1. Log the error with the time and entity 2. Respond the HTTP code instead of simply rendering json object, such as 200, 400/404(when account is not existing) and 500(for any other unexpected error) 3. Alert service This service receives requests from the work/consumer, based on the account and its own configuration, we may send message to email, sms, or APP notification. Also, in the message we may provide a link to the APP or web page 4. Trade recovery service When a transaction request fail, we may forward it to trade recovery service, and define different ways to recover the failed process, such as extra credit for old customers or simply retry the process when other succeeded. ## 2. Distributed Transaction > A distributed transaction is a set of operations on data that is performed across two or more databases. Create a microservice supporting distributed transaction and the following operations: 1. deposit(account_id:, amount:) 2. withdrawal(account_id:, amount:) 3. transfer(sender_account_id:, recipient_account_id:, amount:) Please consider the following topics: ### the tools you would use to create such service When we do not allow our customers' balance to be smaller than 0, the simple option is to use a database service like MySQL or postgresql, and it does not allow multiple writings to the same account, same as the row lock in the previous scene. One micro-service has one database, we may use SAGA pattern to design several of them, so when the service call train breaks at one of the services, it triggers the rollback process, when it finishes, rollback the previous until the beginning. To achieve that, we have two ways, one way is maintain the API session chain, within the timeout, when one service breaks, throw an error to the previous request, then it trigger rollback process inside it. Another way is the service encountered error, call rollback endpoint on the previous service, until the beginning of the chain, all of them will rollback the transaction. ### performance & user experience In the proposal of the live-session, the consistancy is better as in all the actions will be taken in place when any error happen, but it won't have high throughput as it has to maintain all the sessions amonst all services. The second proposal has higher throughput but we have to deal with the inconsistancy while some services might encountered rollback failure. ### maintainability For DevOps part, we can set auto-scaling that's based on both CPU/memory loading, or if we know the usage pattern that most of the traffic comes at a certain time, we can scale it up before it. We can observe the patten of the usage, and add more pod/core to the node pool, and apply with auto-scaling to avoid system shutdown. ### development time estimation We will have with story tickets and have grooming session, assign story owners to propose a solution, and other members discuss on top of the proposal and its acceptance criteria. Finally, give a story point for each task under the story, when all tasks are done, the story should be considered fulfilled. Based on the given condition, we need to design a transaction management system, we can start with a design/survey ticket, let the story owner to sort out among solustions, present them to the team and find out the most suitable one, based on the solution, break down to tasks and follow the grooming process as above. Assuming this system implementation is 50 points, and our past burned point is 20, we can expect in 3 sprints we will complete the acutal implementation.