owned this note
owned this note
Published
Linked with GitHub
# Mini Project 3 Answers
## LAZY Fork
Q1. The report asks us to count the frequency of page faults during the execution of COW fork. Does this mean we have to count page faults specific to COW-triggered faults only, or capture all types of page faults?
[KM] Ideally, you should count page faults specific to COW. However, it is fine if you capture all of them, since all of them will necessary include the COW triggered ones anyway.
Q2. While running usertests after implementing COW, all tests other than textwrite pass, although it passed before implementing it. For MP2 though, we were asked to remove textwrite from usertests, since it was faulty. Do we do the same here?
[KM] `textwrite` failing is fine, since it is faulty in xv6 itself. You can either choose to remove it, or leave it as is (we'll ignore it while testing anyways).
Q3. When does a page fault occur - during page allocation or during page access, that is the first time a page is accessed by a program?
[PS] A page fault occurs when a process attempts to access memory that is in its address space but is not currently located(in this case allocated)in the Physical memory.
## LAZY Read-Write
Q1. Will the entire input will be given at the start of the program , or input can be given after some processing ?
[IG] The input will be given beforehand.
Q2. Is the input to be taken from a file?
[IG] Not needed.
Q3. Can we have two request having the same user ID ?
[IG] Yes.
Q4. If we have a concurrency limit of 5 and let say at time t = 10 there are 3 readers on the file f1. At t = 11 there is a delete request and at t = 12 there is another reader. Now the delete request is blocked until all the readers leave , but should we allow the reader at t = 12 to read the file with other readers (as we are below our concurrency limits) so should we delete first.
[IG] If the concurrency limit is not reached, READ should be taken first. After that DELETE can happen.
Q5. If two events happen at t seconds, then does the order in which they are printed matter? For example, LAZY takes up a WRITE request at t=2 and another user makes another request at t=2, can they be printed in any order?
[IG] Yeah it doesn't matter as long as the timestamps are correct.
Q6. Is the given example output correct? User 3 made a request to delete file 2 at 2 seconds and User 5 made a request to read file 2 at 4 seconds, then shouldn't User 3's request be taken up first instead of User 5's request once User 2 completes writing to file 2?
[IG] Since the file 2 is still being written by 2, READ request is taken up first since DELETE is still waiting.
Q7. Can two users write to different files at the same time?
[IG] Yes.
Q8. What does it mean that READing is allowed while simultaneously writing to a file? What will the user read is the file being read is modified by another user WRITing to it.
[IG] The behaviour is similar to Reader-Writer problem(You can see changes done by user in real-time) but since you are not actually required to implement the READ and WRITE, you don't need to worry about it.
Q9. What would be the expected output for this input:
2 4 6
3 2 5
1 1 READ 0
2 1 WRITE 1
3 2 DELETE 2
STOP
[IG]
```
LAZY has woken up!
User 1 has made request for performing READ on file 1 at 0 seconds [YELLOW]
LAZY has taken up the request of User 1 at 1 seconds [PINK]
User 2 has made request for performing WRITE on file 1 at 1 seconds [YELLOW]
LAZY has taken up the request of User 2 at 2 seconds [PINK]
User 3 has made request for performing DELETE on file 2 at 2 seconds [YELLOW]
LAZY has taken up the request of User 3 at 3 seconds [PINK]
The request for User 1 was completed at 3 seconds [GREEN]
The request for User 2 was completed at 6 seconds [GREEN]
The request for User 3 was completed at 9 seconds [GREEN]
LAZY has no more pending requests and is going back to sleep!
```
Q10. In the given example, lazy can take up the request of user1 at 1 sec and user 2 can send a write req at 1 sec, should the order of these two events happening be the same as example or can they change randomly??
[IG] As long as timestamps are correct, order doesn't matter.
Q11. Do we have to check each second while the thread is waiting to see whether operation is possible or not and print that User cancelled at `t_k` where `t_k = T+arrival`? Or is it fine if we wait for much more than `T` seconds and then when we acquire the required locks, we realise that time > T and print cancelled at t_k seconds where `t_k = T+arrival+extra`
[IG] You have to print it at T + arrival.
Q12. The task requirement states the following "Users cancel their requests if LAZY takes more than T seconds (from the time at which users send their request) to start processing." How should we handle the case when a request starts being processed by LAZY at exactly time `T`? According to the example, user 3 sends his request at 2 seconds. Now, user 3 cancels his request at 7 seconds, but the request can start at exactly 7 seconds (`T` = 5).
[IG] No it cannot.The request cancellation will take precedence in this case.
Q13. In the given test case file 2 is not being read and no one is writing to it at t = 7, so can't we accept request of user 3 at t = 7 seconds?
[IG] Refer Q12.
Q14. Can we get more test cases please.
[IG] Nope.
Q15. Can we limit the number of requests like take it at max 100?
[IG] You can limit the number of requests but it should be > 1000.
Q16. Can we limit the number of files (can i take the max no.of files to be 100 or something?)
[IG] No.
Q17. Can a user perform two tasks simultaneously. For example, can user 1 read file 1, and write to file 1 simultaneously?
[IG] Yes.
Q18. ![image](https://hackmd.io/_uploads/S1Uyfcjlkg.png)what do colour mean here
[IG] You have to color code your output in the color given in bracket.
Q19. Does the order in which the output is printed matter? My output is correct, but the results are not in order.
[IG] Refer to Q10
Q20. If the concurrency limit is reached then should the user wait till it gets chance(if the concurrency limit decreases) or LAZY should cancel the request. If it should wait then the max time would be T+T_arr ?
[IG] The user will wait. Max time is T+T_arrival.
Q21. What would be the expected output for this input:
2 4 6
3 2 2
1 1 READ 0
2 2 WRITE 1
3 2 DELETE 2
STOP
[IG] Figure out yourself.
Q22 and Q23. If an operation example WRITE completes at 6s then another WRITE on the same file which arrived at 4s, then the next WRITE should start at 6s or 7s?
Essentially, if an operation completes at `t`s then can another operation start at `t`s on the same file?
[KM] Yes. The only requirement is that LAZY waits 1 second from the time of the request's arrival before it can take that request up. Assuming that another request arrived on the same file before `t` seconds, LAZY should indeed be able to pick it up as soon as some other request on that same file is done processing.
Q24.Can any two requests can arrive at same time ? . If yes then what is the output of the following test case if only 2 concurrent users are allowed
1 1 READ 0
2 1 READ 1
3 1 READ 1
Since the both 2 and 3 came at the same time what should we consider do we need to consider both ?
[IG] You may handle as you like. Just state as assumption.
Q25. Referring to Q16, I don't see why cannot we limit the files. One request can access one file only, so we can limit the number of files to the number of requests.
[IG] You are explicitly provided the number of files in input. There is no need for predefined limit.
Q26. If a user say user1 requests to delete a file say file1 at t seconds then lazy takes the request of user1 at t+1 seconds then if user2 requests to read or write to file1 at t+1 seconds , should the request be declined by lazy at t+2 seconds? what should be done?
[IG] Decline the request immediately.
Q27. let write operation take 4s then user1 requests to write into file1 at ts then lazy takes request at t+1 s if user2 requests to write into same file: file1 at t+1 s should lazy decline the request?
[IG] The request will be delayed, not declined.
Q28. REFERENING TO Q3 if multiple requests with same user id how we differnatte between requests
[IG] You don't need to. We will not test such cases.
Q29.
1 1 READ 0
2 1 READ 1
In this case what should be output should lazy start processing both of them at 1 second assuming max_concurrent_limit greater than two
[IG] No, the request for user 1 will be processed at 1 second and the request for user 2 will be processed at 2 seconds.
Q30. If no of files is 2 but user tries to access 100 file what should happen
[IG] Decline the request.
Q31.
![image](https://hackmd.io/_uploads/ryoBLzVWkx.png)
CAN I MAKE FOLLOWING ASSUMPTIONS
1. THE REQUESTS ARE GIVEN IN INCREASING ORDER OF t_i
2. t_i is like 1,2,3.... i.e every second a request arrives
[IG] Both assumptions are wrong.
Q32. What happens if o_1 is erroneous (ie. instead of "READ" or "WRITE", the user types in "HELLO"), what should be the output?
[IG] Decline the request. You are free to choose the error message as you like.
Q33. If a single user makes two different read requests on the same file (such that the second read request would execute before the first read request is completed), how is that supposed to be treated? Is the second request supposed to be delayed, declined, etc.? If it supposed to be taken up immediately without delay, does that count as 2 different users accessing the file in terms of "c" (the maximum number of users that can access a file at a given time)?
[IG] We will not be testing such cases. If you wish to handle this, it would be treated as 2 different users accessing the file.
Q34. Just a further clarification on Q26, every time there's an invalid file access (whether it be because the file was deleted or it doesn't exist), it should occur at t seconds (assuming the request was made at t seconds) or can it occur at t+1 seconds when it is supposed to be taken up?
[IG] Request can be declined at t seconds.
Q35. Do we need to use threads for this part or can we do it without threads?
[IG]~~If you can handle concurrency without threads, you are welcome to do so. Do note that you may need to explain your approach in the eval.~~
**UPDATE**: You can only use concepts taught in class.
Q36. If a delete operation is requested at `t=2` and is taken up at `t=3`, and another request is at `t=4` and delete operations takes 6s. Will the second request wait for the completion of the delete request and display invalid at `t=9s` or at `t=4`(i.e, as soon as the request can be taken up) ?
[IG] If a delete is already underway, decline any other request on the same file.
Q37. (With reference to Q9) isn't the output incorrect? The delete request shouldn't be taken up untill all the read/write requests are completed right?
[IG] DELETE request cannot be performed on a file if a READ or WRITE is happening on the same file.
Q38. Say if a write request is completed at t seconds, and a delete request and a read/write request are both ready to be executed at the same time then is it ok to assume that whichever request arrived earlier would be given the prefference? Also what should be done if in the same scenario both a read and a delete request arrived at (t-1)s , which should be executed at (t)s ?
[IG] You can assume that. In case both request arrive at same t-1 seconds, READ should be preferred.
Q39. Say if there are two requests which came at the same time but both cannot be executed at the moment because of concurrency limit (max number of users accessing the file), when they can actually execute can we asssume the order of execution to be random as we can't determine which request acquired the lock?
[IG] The earlier one should be executed.
Q40. Can we use sem_timedwait function in order to wait for a semaphore for only a given amount of time and return if it exceeds T.
[IG] ~~No.~~
**UPDATE:** This is allowed
Q41. Is it fine if we imitate the delete read or write request by sleeping for the required amount of time.There should be no harm in doing this right.
[IG] Yes.
Q42. Is it necessary to do sleep or can we just ensure that the output is correct (like If we use sleep the query might require 20 secs but is it fine if without sleep the ouput comes in immediately ?
[IG] The simulation must be in real-time.
Q43. In relation to Q40, can we use pthread_mutex_timedlock (which serves the same purpose but for threads)? If not can we use pthread_mutex_trylock() to test every second if a lock has been acquired? We shouldn't wait for a time longer than T for any lock, so there should some funciton to let that happen right?
[IG] Yes you can.
## LAZY Sort
Q1. Can ID be a String or is it just integer?
[KM] ID is an integer.
Q2. Should we use a constant like `max_threads` and hence implement a basic task queuing system (Add all tasks to a queue and at any given time only `max_threads` amount of threads can be spawned). Or do we assume that we can spawn as many threads as we like (Just spawn all tasks at once for each level of merge).
[KM] You are free to work out the details of implementation; either way is perfectly fine (or you can choose to do something entirely different as well). **However, please document what you're doing, and what the pros/cons of one approach over the other are, as part of your report. The report specifications on the MP3 document have been updated to reflect this.**
Q3. Can we use 1 thread extra for managing everything? Or do you expect us to use locks etc and implement it like a recursive function?
[KM] You can use an extra thread to coordinate things, that's perfectly fine. However, I'm not sure about the latter part of your question, could you please rephrase that?
Q4. Please tell me if I understand the qustion wrong , but can't the problem be solved without using concurrency concepts (especially without using threads,locks).So is that acceptable?
[KM] Since you're expected to implement a distributed system but you have to run it on your laptop, and considering that each "system" sorting the array is supposed to run parallelly, this problem requires concurrency concepts to prevent race conditions.
Q5. Can we assume that ID's are unique?
[KM] Sure.
Q6. Can we use other sorting algorithm for sorting strings in count sort ? Or is it mandatory to use count sort for strings(name and timestamp)?As count sort is very efficient for numbers and becomes complex for strings.
[KM] You can do something like hashing the string to generate a unique number for each string, and use those numbers to count sort. Alternatively, you can do something else but ultimately, you have to use count sort.
Q7. a) What does distributed implemetation here mean? Does it mean distributed over different networks, or file systems or list of files/folders in multiple distinct multiple files? How do we ensure that we are testing distributed nature of things?
If we actually have to simulate files spread across multiple machines or network locations we would have to implement the networking for this. How can we go about doing this?
b)If we are not supposed to implement the networking between different machines, do we partition files into chunks and consider each chunk as a different node?
Does the input file look something like this?
```
5
node1 fileA.txt 205 2023-10-02T08:00:00
node2 fileB.txt 207 2023-09-30T10:10:00
node2 fileC.txt 203 2023-10-01T15:20:00
node3 fileD.txt 201 2023-09-29T17:15:00
node3 fileE.txt 204 2023-10-01T12:00:00
ID
```
Or do we use multiple files? One for each node?
Somethhing like this?
```
distributed_system/
├── Node_A.txt
├── Node_B.txt
├── Node_C.txt
└── main_data_file.txt
```
```
main_data_file.txt
50
Node_A fileA.txt 205 2023-10-02T08:00:00
Node_A fileB.txt 207 2023-09-30T10:10:00
Node_B fileD.txt 201 2023-09-29T17:15:00
Node_C fileC.txt 107 2023-10-01T09:15:00
…
ID
```
```
Content for individual nodes
Node_A.txt
fileA.txt 205 2023-10-02T08:00:00
fileB.txt 207 2023-09-30T10:10:00
Node_B.txt
fileD.txt 201 2023-09-29T17:15:00
Node_C.txt
fileC.txt 107 2023-10-01T09:15:00
```
c)Files belonging to different nodes may not have unique names or IDs. What can we assume to be unique for all files? Or are the files only differentaited on the basis of which node they belong to?
[KM] a) You do not need to *actually* test it out on different computers. Essentially, distributed here refers to the actual sorting spread out among multiple processing units -- threads in this case.
b) Up to you to decide how you will achieve this. One way to go is have a coordinator thread which is responsible for partitioning the data and handing it off to different threads, and all the threads perform sorting on their chunk of the data (and you decide how you reconcile the different sorted chunks to return one final sorted array).
c) You can assume that all IDs are unique (across nodes).
Q8. Count sort for strings? How does that work?
[KM] Refer to Q6.
Q9. (Refering to ans of Q6) If we generate a unique number for each string and do count sort on numbers, we cant get the sorted list of strings because for that we must give smaller numbers to lexographically smaller strings and for that we need to know the lexographical order of strings but that is what have to find out.
From my understanding, I think strings can't be countsorted but please correct me if I am wrong.
[PS] Strings can indeed be count sorted. The count sort for strings would involve mapping the string to a number (as suggested in Q6 via a hash function) such that lexiographical ordering of the strings and corresponding numbers is maintained. This can be done via a function and doesnot need a explicit mapping to be stored. The evaluation of the question would involve constraints which keeps in mind the overflow condition and thus you donot need to worry about these explicitly. In case you make any non trivial assumption make sure to mention and justify it in your report. (You can take the assumption that provided string always has only lowercase alphabets)
Q10. Is there a maximum limit on the range (max - min) of IDs/timestamps in a test case (e.g. 1e5 or similar) so that we can allocate an array of allowable size?
[KM] You can assume such a limit, say, 1e5. But make sure that you can change the limit easily (without having to change it in multiple places) if required, so that we can change it during testing.
Q11. Can we declare fixed size for filename/timestamp and call countsort for max len from the given strings, the complexity changes to O(MAXLEN*N), is this valid or should I use the procedure as specified in Q6?
[KM] For filename, @eNt5z1YeSheYyv6FuWEHXw will answer this. For timestamp, I'm not sure why you would need to do that as you do not need to consider the timestamp as a string in order to sort it :)
[PS]For string based filename your implementation can set a appropiate constraint on the length of the name [Refer Q18], and perform the distributed count sort under set constarints. The implementation you suggest is much more similar to Radix sort than Count sort which is not acceptable. The sort needs to implemented in a single pass over the entire string.
Q12. Do we need to handle cases where 2 strings from name are mapping to the same number when using our hash function?
For example string1 and string2 may map to the same number x with the hash function used. Do we have to handle such
cases separately or we can assume such collisions won't occur?
[PS] You dont need to handle cases for collision beyond a specifed filename string length constraint. However make sure that this constraint can be easily changed around. Another significant part of the hash is that it should maintain lexiographical ordering [Refer Q9]. Ideally you are expected to implement (and not assume) a collision free hash under a appropiate constraint limit.
Q13. The clarification for Q6 stated that ultimately we have to use count sort to sort Names and Timestamps (sorting criteria that are strings) in case threshold is less than 42. However, can we use multiple passes on each character position emulating LSD (Least Significant Digit) Radix Sort, where each character position is treated as a "digit" in the sorting process? Or should we just stick to mapping each string to a single integer that only requires a single Count Sort pass? My implemetation of the latter approach restricts the string length to 8 as my hash value is large enough that there is risk of overflow. While the former approach (emulating radix sort) is enabling me to work with much longer strings.
[PS] A single pass implementation would suffice as a valid count sort implementation. As far as the constraints on length of filename string is concerned you can set the said constraint as a variable which can be easily modified depending on your implementation [Refer to Q18]. But the implementation should behave as countsort not as radix sort.
Q14. So essentially we are not supposed to use MPI or anything like it to implement the distributed system, right?
[KM] No. Use only concurrency primitives, like mentioned in the doc and in Q4.
Q15. Do we have to account for name hash collisions in countsort?
[KM] Same as Q12.
Q16. In reference to q9 and q6. We will need to make a hash function for 128 characters, which also preserves ordering, i.e hash(fileA.txt) < hash(fileB.txt). Also consider that a reasonable max array size is like 10,000 for this (also consider 10 bucket size needed inside each slot if you want to). Now when you try to generate these hashes, there will inevitably be overflows. Handling these overflows is not trivial. If you simply take modulus and place it from the starting and use bucketing also then you will have to drastically change the countsort algorithm and it becomes extremely complex and is no longer countsort. The true countsort implementation would be to iterate over the entire searchspace i.e 0 to max(hash(filename)) WITHOUT adding any modulus to the hash, i.e the max would be theoretically 2^(8\*128) if all chars allowed, (26+4)^128 if only alphabets + extra chars.
Any submitted implementation will be remote from countsort. Either allow that, or let us know if you have a specific logic/implementation in mind ***in detail***.
[KM] You can impose a constraint on the string for filename length that's significantly less than 128 chars [Refer Q18]. This way, you wouldn't have to apply a modulo. This has been mentioned in Q10.
[PS] To add to the answer above provided constraints mention olny lowercase alphabet as valid characters and the filename length constraint can be assumed within a acceptable range. Taking this constraint as 8 [Refer to Q18], we get a total of 26^8 = 208827064576, which is the total number of possible allowed file names. The corresponding implemented hash must then map this space lexiographically which is possible as this is well within constraint for maximum mapping allowed in C [ long long unsigned ] which is 2^64 - 1 = 18446744073709551615. Thus there are multiple possible hash function mapping that would lead to a normal count sort implementation.
Q17. For the line graph, execution time refers to the CPU time or the wallclock time?
[KM] Execution time refers to the time that the actual count sort/merge sort procedure runs for, and returns the final output. You can either choose to consider the partitioning of the array to be part of this, or you can ignore that. Either way, it is fine, but mention what you're doing in your report.
Q18. Can we assume that our filename is atmost some 8 length ?
[KM] Use an `unsigned long long int` instead and see how big of a string you can take before you have to modulo. However, 8 is fine. Just mention this as an assumption in your README, and make it so that this number can be changed easily (perhaps maintain a singular variable somewhere that controls this).
Q19. The hashing techique is generating very large number (if we are to maintain the lexicographic order) , so can we use something like Radix sort or some other sorting algo to generate the hashing number.
[KM] As mentioned in Q10, Q16 and Q19, you can impose a smaller filename length limit.
Q20. Is there any restriction as to which libraries we are allowed to use? More specifically, is the MPI.h library allowed?
[KM] Do not use the MPI.h library.
Q21. Are we allowed to use a trie for sorting on Name and Timestamp for count sort?
[KM] That would be Trie Sort, which isn't really count sort. So, no.