LinkMaker Code Review

# LinkMaker Code Review The script is well-constructed and simple. The usage of argparse is particularly neat! There are some areas of improvement around maintainability and performance as well as some simple fixes and suggestions for modularity. I have left suggestions broken up into two sections: 1/General code review and issues (clarity, readability, idiomatic python, or maintainability) 2/Some suggestions to increase efficiency to O(1)/O(n): ## Base Review ### 1. **Idiomatic Python**: - **SheBang!** The shebang `#! /usr/local/bin/python3` makes an assumption about the location of the Python interpreter. While this might be correct in many instances, it's more robust to use `#!/usr/bin/env python3` which will find Python based on the system's PATH. - **Imports** `urllib.parse` is imported as `urlparse`. It’s generally a good practice to retain the original module name for clarity unless there's a strong reason to rename it. - **Naming Conventions** - Python follows the `snake_case` convention for variable names. So, `andcondition` would be more appropriately named `and_condition`. - Some variables are disabled for pylint check, like `supplementAccounts`. It would be better to name them in `snake_case` instead of disabling lint checks, i.e., `supplement_accounts`. - **String Concatenation**: - Pythonic way of formatting strings is using f-strings (for Python 3.6 and above) or the `format()` method. Instead of: `"Number of accounts read from account file: " + str(totalAccounts)` instead consider using `f"Number of accounts read from account file: {totalAccounts}"` ### 2. **Argument Parsing**: - The use of argparse is appropriate and clean. The mutually exclusive group is a nice and neat way to handle the `--and` and `--or` flags, good job! :-) - It's good that you've provided default values for some arguments (`idfield`, `quiet`, and `format`). This enhances user experience. ### 3. **Code Structure/Modularity/Error Handling**: The current code places most logic in the script's global scope, which makes it harder to test, reuse, or maintain. - To improve maintainability and readability, consider breaking down the code into functions or classes based on its functionality. For instance: - `read_supplemental_tags()` - `read_account_info()` - `generate_output()` - **Error handling**: Incorporate more robust error handling. For instance, when opening files, handle potential IOErrors. - Error handling when checking if 'Account ID' exists in the supplemental tags list is good. This will prevent the script from proceeding with an incorrect supplemental file. - The logic to check for tag matches (`if args.andcondition: ... else: ...`) can be a bit clearer. Consider using more explicit variable names or adding more comments. ### 4. **Documentation and Readability**: While some parts of the code are well-documented, there are blocks of logic that could benefit from more detailed comments. For instance, the section where you combine primary and supplemental tags could use some elaboration. - **Docstrings**: Each function (once they're separated out) should have a docstring explaining its purpose, parameters, return values, and any exceptions it might raise. - **Consistent naming conventions**: Ensure consistent naming conventions (e.g., `snake_case` for variable names in Python). - **Remove magic values**: The script contains URLs and other magic values hardcoded within. These should be placed in constants or configuration files, making them easier to change and reducing the risk of errors. To make the script more maintainable, you could extract these values and place them at the top of the script as constants. For example: ```python BASE_URL = "https://aws-ciw-readonly.amazon.com/cost-management/home?spoofAccountId=" TAGS_COLUMN = "Tags" ACCOUNT_NAME_COLUMN = "Account Name" ACCOUNT_ID_COLUMN = "Account ID" PAYER_ID_COLUMN = "Payer ID" SUPPLEMENTAL_PREFIX = "s:" TAGS_DELIMITER = "; " DEFAULT_JSON_TEMPLATE = { "dimension": { "id": "LinkedAccount", "displayValue": "Linked Account", }, "operator": "INCLUDES", "values": [], } ``` ## Efficiency Review: As with any code changes for efficiency, review the changes in line on a matrix of value/cost with an input of time.Some of data structures used in the script aren't optimal for the operations being performed, leading to inefficiencies. - **Use dictionaries for faster lookups**: Using a dictionary for `supplementalTags` and grouping accounts by their `Payer ID` would result in significant efficiency gains. - **Pre-calculate sets**: Avoid converting lists to sets multiple times in loops. Pre-calculate and store these sets before entering loops. ### Suggested Changes, use your own judgement 1. **Supplemental Tags Check**: ```python supp = next( ( acct for acct in supplementalTags if acct["Account ID"] == temp["Account ID"] ), None, ) ``` Here, for each account in the primary CSV, we are looping through all accounts in the supplemental CSV to find a match. This results in a time complexity of O(n^2) for this operation. **Solution**: Convert the `supplementalTags` list into a dictionary with 'Account ID' as the key for O(1) lookups. ```python supplemental_tags_dict = {tag["Account ID"]: tag for tag in supplementalTags} supp = supplemental_tags_dict.get(temp["Account ID"]) ``` 2. **Tag Matching**: ```python if args.andcondition: if set(args.tag).issubset(set(temp["Tags"])): ... else: if not set(args.tag).isdisjoint(set(temp["Tags"])): ... ``` While set operations like `issubset` and `isdisjoint` are relatively fast, repeatedly converting lists to sets within loops can become inefficient. **Solution**: Convert `args.tag` and `temp["Tags"]` to sets before the loop, and then use these sets within the loop. ```python args_tags_set = set(args.tag) ... for row in reader: ... tags_set = set(tags) if args.andcondition: if args_tags_set.issubset(tags_set): ... else: if not args_tags_set.isdisjoint(tags_set): ... ``` 3. **Payers List Lookup**: ```python accts = [row for row in accounts if row["Payer ID"] == payer] ``` While this is still a linear search, if the number of unique payers is high and the list of accounts is long, this can be inefficient. **Solution**: Use a dictionary to group accounts by their payer ID during the initial processing. This way, you can look up all accounts for a given payer in O(1) time. ```python accounts_by_payer = defaultdict(list) ... for row in reader: ... accounts_by_payer[row["Payer ID"]].append(temp) ... for payer in set(payers): accts = accounts_by_payer[payer] ``` 4. **Repeated String Concatenation**: It's a very minor point, but repeated string concatenation in a loop can have O(n^2) behavior in some cases (although Python often optimizes this on the backend). This suggestion is also in the previous guidance because reformatting this will increase readability. **Solution**: Use f-strings or `.join()` method for string constructions that involve multiple concatenations.