August 4 - HackMD

# Progress ## Thesis Malevich needs a mechanism for testing components efficiently. Right now, running pipeline using dev. cannot be called efficient due to the following problems: 1. Sometimes logs are totally missing making you unable to determine the source of the problem. 2. In terms of debugging the maximum you might do within Malevich is printing, which is completely inefficient debug strategy. 3. Dev. tool overhead is quite huge so for small bugs like type mismatch or edge-case missed it takes a lot of time to fix them. ## Solution That is why, I focused on testing existing component in a local environment. However, today I encountered the problem related to schema importing. ```python from jls import ( texts, text_with_embeddings, text_pairs_with_scores, text_indices, scheme1, recipient_scheme, ) ``` While I am quite it is not the problem to import non-existing schema when run with Dev. tool, it turns to be a problem when trying to run this code locally with PyTest. As the fix for this might not appear soon, I propose the following: > An application, mainly processors, should serve as simplistic wrappers around actual implementations. Which means that for any future application we write, it is reasonable to have the following structure: ``` | some_app |__ apps |__ impl <- the actual implementation |__ some_processor.py # def some_processor(...): ... ... |__ processor.py # def some_processor (...): impl.some_processor(...) ... ... ``` So we will have a submodule named `impl` that will contain python code for the implementation of the logic we want to deploy on Malevich. Functions in this submodule will not be decorated with `@jls` library making it possible to test them with locally. The decorated functions will be a delegated call to the actual implementation. For example: ```python= # processor.py import impl @jls.init(prepare=True) def initialize_langchain(ctx: Context): """Initialize the langchain app. Initializes two objects: - Embedder (ctx.app_cfg["embedder"]) - used for embeddings - Chat (ctx.app_cfg["chat"]) - used for chat """ impl.initialize_langchain(ctx) ``` ```python # impl/initialize_langchain.py def initialize_langchain(ctx: Context): """Initialize the langchain app. Initializes two objects: - Embedder (ctx.app_cfg["embedder"]) - used for embeddings - Chat (ctx.app_cfg["chat"]) - used for chat """ ctx.app_cfg["embedder"] = get_embedder_with_backend( backend=ctx.app_cfg.get("embeddings_backend", "openai"), embeddings_type=ctx.app_cfg.get("embeddings_type", "symmetric"), model_name=ctx.app_cfg.get("model_name", None), api_key=ctx.app_cfg.get("api_key", None), ) ctx.app_cfg["chat"] = get_chat_with_backend( backend=ctx.app_cfg.get("chat_backend", "openai"), api_key=ctx.app_cfg.get("api_key", None), temperature=ctx.app_cfg.get("temperature", 0.5), ) ``` ## Summary So to sum up my idea, having separate implementation and Malevich interface declaration: **[+]** Enables testing with PyTest **[+]** `processor.py` now is more in declarative style making it easier to understand the capabilities of certain library (or module within app) **[+]** Code is debuggable **[--]** We have to keep either both documentation updated, or do not document implementation. # Task Your task still remains to finish the pipeline. Additionaly, I am asking you to restructure hf repository in the same way I did with langchain and write tests for it