MPHY0021 Coursework 1 feedback

# MPHY0021 Coursework 1 feedback This document discusses some misunderstandings or areas to improve on that were encountered frequently in the submissions. Although you will receive personalised feedback, go through this list and think about which of these comments may also apply to your own submission. ## General comments - When parts of the code become obsolete, remove them, don't comment them out. This keeps your source code cleaner and easier to read and maintain. Almost no one likes deleting code they have written, but it's generally worth it! - Don't hardcode paths to files that only exist in your own computer. This stops others from using your code. - Put all code that should not run when importing the file behind `if __name__ == "__main__":`. Otherwise, that code will run every time someone imports a function (and if it fails, e.g. because a file is missing, then the import will fail too) - Some of the tasks we requested could be done more easily by using a library like `pandas`, but this was not allowed for this assignment. Keep in mind that using an external library introduces a dependency: others will also need to install the library before they can use your code, and the library's interface or behaviour may change in the future. Consider whether the benefit you gain is worth the cost (it often is, but not always!) ### Checking types - When checking the type of a value, prefer `isinstance(x, int)` to `type(x) == int`. Not only does this catch subclasses, but you can use it for multiple types at once: ```python isinstance(5, (int, float)) # true isinstance([4, 5, 10], (list, tuple)) # true ``` - Avoid using `eval`, which can execute arbitrary code and is a security risk: ```python # Instead of: isinstance(eval(str(line[4])), float) # Consider: try: value = float(line[4]) except ValueError: ... ``` ## Structure - Keep an eye out for code that is repeated. "Fixing" that improves the maintenability. - In Python, unlike e.g. C++ or Java, you generally don't need to define "getter" and "setter" methods. If you want to customise how a field is access, consider [properties](https://stackoverflow.com/questions/6618002/using-property-versus-getters-and-setters). ### Object attribute, variable or global? - There is almost always a better alternative than using global variables in a function or class method. Using global variables can lead to unexpected behaviour when multiple different functions or methods use the same global variables and make it difficult to test and debug code. In general you should use locally scoped variables within functions and methods whenever possible. If you need to share values between class methods, you should use attributes of the current instance using the `self` argument to methods. - However: be wary of storing values in an object’s attributes when you don’t need them. It can be useful if you want to use it between different methods, but it also adds complexity and is a potential source of errors (are you sure that the value hasn’t been modified without your knowledge?). Consider whether it would be better to use a local variable, which would disappear at the end of the method. ## Style - Linters and tools like `pycodestyle` check and format your code for legibility with little effort! - Try to avoid using variables names corresponding to [built-in functions / types](https://docs.python.org/3/library/functions.html) such as `max`, `min`, `tuple`, `list`, `len`, `sum` - using such names both prevents you accessing the built-in within the scope the name is defined and can be potentially confusing to someone else reading your code who expects the name to mean something else. - Be specific with variable names, e.g. "mass_balance_measurements", not "values". - Avoid implementation details in names: prefer "glaciers" to "list_of_glaciers"/"glacier_list" (what if it becomes a dictionary?) - One way of making your code future-proof is to avoid hardcoding specific dates. For example, to check whether a measurement is in the future, you can compare it to the date on which the program runs, which you can get using `datetime.now()` in the standard library, instead of the year 2021. This means that your code is likelier to behave correctly in the future without many changes, making it more sustainable. ## Data loading - In object-oriented programming, keep the responsibilities of each class clear and separate. For example, if a `Glacier` is responsible for holding the information of a glacier, then this data should not be stored directly on the `GlacierCollection` too. Avoid this: ```python class GlacierCollection: def __init__(self, ...): self.latitudes = ... self.longtidues = ... ``` Instead, the Collection can hold a list, dictionary etc of `Glacier` objects, and access that information indirectly through each `Glacier`. ```python class GlacierCollection: def __init__(self, ...): self.glaciers = ... ... def nearest(self, ...): latitudes = [glacier.name for glacier in self.glaciers] ``` Ideally, a good class structure will lead to **less** repetition of code and data, not more! - Even in the same class, avoid duplication of data. Don't keep both the "raw" input value (e.g. a string) and the processed result unless you think you will need the raw value again. Having two copies makes it easier to make mistakes, and adds complexity (which value do I use? are they both in sync? what type was this again?) - When choosing data structures, consider how you will be using them. If you need frequent lookups (e.g. find a glacier by its id), a dictionary is likely better than a list, for performance, conciseness and ease of handling errors: ```python # Dictionaries offer quick and easy retrieval by key my_dict[glacier_id] # raises KeyError if key not found my_dict.get(glacier_id, default_value) # returns default value if key not found # Lists are more useful for iterating over every element, # but also offer access by element index = my_list.index(glacier_id) # raises ValueError if value not found # note that index will always return the first occurrence of glacier_id only! my_list[index] ``` ## Analysis - Make sure you check the expected format or units of the arguments to functions from libraries you are using. For example, the trigonometric functions in the `math` module in the standard library expect their arguments to be in radians instead of degrees. - Where there are already existing functions for performing standard tasks (and they are in libraries that you are able to use in a particular project), it is often better to use these rather than implement your own. For example, the `math` module in the standard library provides a `radians` function for converting angular quantities to radians from degrees - using this rather than writing your own solution makes it less likely you will make a mistake. Similarly there are built-in functions and object methods in Python for sorting collections and for finding the smallest or largest elements in a dataset. - It is typically more readable to use a boolean value directly as the condition in an `if` statement rather than comparing to `True` or `False`: ```python if isinstance(x, int): if not isinstance(x, int): # instead of: if isinstance(x, int) == True: if isinstance(x, int) == False: ``` - Also it is typically better not to reverse the meaning of boolean arguments in your code. For example there were similar snippets to this in several of the `GlacierCollection.sort_by_latest_mass_balance` implementations ```python if reverse == True: reverse = False else: reverse = True ``` which makes it difficult to reason about what `reverse` means at different points in the code. - When using dictionaries, be careful to use keys that you know will be unique, unless you are sure you want to overwrite the values when multiple items have the same key. Several people used distance from the specified point as a dictionary key in `GlacierCollection.find_nearest` which resulted in unexpected behaviour when multiple glaciers where at the same distances from the specified point. - Make sure to carefully check what the expected interface for functions is, including whether arguments are optional and what their default value is if so, and what the return type of a function is, if any. It is common in software engineering to need to write functions that conform to an interface to allow them to interact with external code. If you do not meet the requirements you code is likely to raise errors or give unexpected outputs. - In general unless the specific purpose of a function is to print output to the `stdout` stream, as was for example the case for the `GlacierCollection.summary` method here, it is generally better not to have `print` statements as side-effects in other functions. If you do want to record a message, in most cases it will be better to use the functionality in the `logging` module in the standard library (which allows logging information at different priority levels) or if appropriate, generate a warning (see [Validation](#What-to-raise) section below). In both cases, the user can programmatically control whether such output is sent to `stdout`. - For functions or methods which compute and return collections such as lists, if the computed collection is empty, it is generally better to return the empty collection rather than return an alternative value such as `None` or raise an exception (unless you are very sure that the collection being empty is *exceptional* behaviour and indicative of a problem). This ensures that downstream code which, for example, expects to be able to call `len` on the output of the function, will not fail. ## Validation ### What to raise - Using `print` is not the same as raising an error! Compare these two lines: ```python print("Invalid value for measurement.") # this does not stop the program # vs raise ValueError("Invalid value for measurement.") # this alerts the user ``` The latter will notify the user that something has gone wrong; they can then either stop the program flow, or handle the error (with `try... except`). Instead, printing will fail "silently", so the program will continue with some improper values and the user may not notice the problem. - If you want to show a message without stopping the execution, you can use a warning: ```python import warnings warnings.warn("Missing value detected. Result may be incorrect.") ``` These can be detected and handled, both in tests and in "real" execution. - Instead of using `assert`, it is better to raise a specific exception that corresponds to the problem. One suggestion is to use `assert` only to check that your code behaves internally as expected, not for validating external inputs. - Make sure that error messages are informative and specific so that they help identify what went wrong, where and how to correct it. "Something went wrong" isn't helpful. ### What to check and where - Think about edge cases for what kinds of inputs your functions will receive, and check for them. - Rather than calling validation code only from `GlacierCollection` prior to using the values to create `Glacier` objects, it would have been better to do the validation in the `Glacier` class itself. That would allow use of the `Glacier` class by itself whilst also including its input validation. - Be careful to check the type of an input before treating it as the expected type, e.g. trying to call string methods on integer inputs or using `len()` on a non iterable datatype: ```python number_of_digits = len(string) # this will crash if the value is not a string! if not isinstance(value, string): raise ValueError(f"Invalid value encountered, must be a string: {value}") # number_of_digits should be computed here instead, # after we know that the type is correct ``` ### Catching errors - You can respond to errors being raised with `try ... except`: ```python def validate_latitude(latitude): if not -90 < latitude < 90: raise ValueError(f"Invalid latitude: {latitude}") lat, lon = ... try: validated_latitude(lat) validate_longitude(lon) my_glacier.set_location(lat, lon) except ValueError as err: # Here, we can recover from the error, e.g. by using some default values # (although that can be unintuitive, so be careful - # probably not a good idea here!) my_glacier.set_location(0, 0) # or we can give up entirely more gracefully, # deleting any resources we are holding # We could use a new error message... error_message = f"Invalid values encountered for a glacier: {err.args[0]}" raise MyCustomError(error_message) # ...but we can also re-raise the original error: raise ``` ## Tests ### Writing and updating tests - Tests are code too, and they should be readable! Make sure to use meaningful variable names, and docstrings to outline the intent of the test: ```python def test_output_empty(): """Check that the function returns an empty list for empty inputs.""" ``` - Commit all the data files that your tests need. Make sure that your tests pass at every commit (continuous integration can help with this). - Make sure expected output is understandable when you read the test. It’s a lot better to have hardcoded expected values (that are easily calculated) rather than complicated logic to test the case. - Make sure that each test has an `assert` statement or a `pytest.raises` check! - If you are checking that an error is raised, you don't need a final assertion (unless you're checking properties of the error): ```python def test_find_nearest_error(define_collection, err, lat, lon, n): collection = define_collection with pytest.raises(err): collection.find_nearest(lat, lon, n) assert True # not needed! ``` ### What to test and how - Tests should ideally check one particular aspect of a function with one case. If you find multiple asserts or test cases in a test, you should either parameterise the test or break it up into multiple tests. For example: ```python def test_output(): x = my_function(5) assert x == 10 x = my_function(-5) assert x < 0 ``` It would be better to replace the above test with two. This makes it easier to see exactly what has gone wrong when a test fails. - If you're testing the output from a function, ensure that all aspects of the returned information is tested (e.g. if returning multiple items, check all elements or at least check specific elements and the number of items). ```python def my_function(x): return [x, 2 * x, 3 * x] def test_my_function(): y = my_function(5) # avoid this, as it checks very little of the output assert y[0] == 5 # instead, check all the elements assert y == [5, 10, 15] # or, if the list is too long, you could check some basic characteristics # (these could also be separate tests) assert len(y) == 3 assert y[0] == 5 assert y[-1] == 3 * y[0] ``` - For negative tests, use `pytest.raises`. Other home-baked ways often will have silent failures. - Tests should check the public interface of your code, so that you know that it functions as expected rather than only testing utility functions that are used by the public interface (although you can also test those).