BAKK Projekt - HackMD

###### tags: `fh` # BAKK Projekt [TOC] ## Besprechungen ### Erste Besprechung mit Andreas und Moritz Gap zwischen Computervision engineers und rollout Backend entwicklung und rollout gap zwischen experimentellen und praktischen. best practices "Machine Learning Operations" Stand von machinelearing model und operations? Wie sind konzepte und methodik? "Die geben uns ein Model wir machen nen Wraper drum und das wars" Man muss beide seiten berücksichtigen. Modellentwickler wollen wissen wies am ende läuft und ausschaut beim Kunden. Aber am Ende des Tages schauts in der Praxis anders aus. Ev. lifecylce verbessern. https://en.wikipedia.org/wiki/Machine_learning https://en.wikipedia.org/wiki/MLOps https://de.wikipedia.org/wiki/OODA-Loop#/media/Datei:OODA_Deutsch.png Pratktischer teil Entwickler-Produktion Simuliere eine Entwicklung und zeigt den Lifecylce auf inclusive Produkt. Wie sieht der dann bei einem Rollout aus? TENSORFLOW SERVING (Phillip Gebhard, Florian) Wie ist da das hosting? Wie müsste man partium anpassen? Was braucht man für eine Infrastruktur? Reicht mein Vorwissen aus? Einschätzung. So klein verpacken wie nur irgendmöglich. ### Gespräch mit Anton ISOs Auseinandersetzen von orginasitorischen Themen * Zuständigkeiten * Prozesse * Verantwortlichkeiten Wer Prüft? IBM oder anders? Wie organisiere ich das? Wie strukturiere ich das? Bezieht sichs auf alle Firmen(Spanien etc.) Wie nehmens die Mitarbeiter an? ORGANISATOIN UND STRUKTUR Change Management, leute daran gewöhnen Interviews um machbarkeit Verträglichkeit zu checken. Komplette ISO unmgölich für MICH bis ende 2021. Guter Anfang wäre kontakt mit externen Auditor/Externen Berater. Prof. ist da sehr vorsichtig. Mehr organisationsanalyse als Bachelorarbeit. Klingt mehr nach Arbeitsauftrag von Firma an mich als verwendbare Bakkarbeit für meine Firma. Bringt mir in eine Abhängige Lage. "Was sagen die Mitarbeiter in einem revolutionären Unternehmen zu einer Standardisierung wie der ISO 21001" Wo bin ich in 20 Jahren? Management?Beratung? etc. **Entscheidung GEGEN ISO** ### Moritz Nossek Habe kurz mit Moritz gesprochen wer in der Firma für die Verschiedenen "Stationen" in der Pipeline eine gute Ansprechperson wäre. * Datensätze: Bernd eventuell, da wusste Moritz aber auch nicht weiter * Modellentwicklung: Thomas Kazmar am ehesten (Head of R&D) dann eventuell noch Roman Gurbat * "Wrapping": Phillip Gebhardt hat einen Guten Überblick über Model zu Programm entwicklung #### Papers zum Thema: "Machine Learning: The High-Interest Credit Card of Technical Debt" - D.Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, .... "Rules of Machine Learning: Best Practices for ML Engineering" - Martin Zinkevich #### Leseliste * https://en.wikipedia.org/wiki/Machine_learning * https://en.wikipedia.org/wiki/MLOps DONE * https://de.wikipedia.org/wiki/OODA-Loop#/media/Datei:OODA_Deutsch.png DONE * https://en.wikipedia.org/wiki/DevOps * https://en.wikipedia.org/wiki/Software_development_process * https://en.wikipedia.org/wiki/Continuous_integration * https://en.wikipedia.org/wiki/Continuous_delivery * https://en.wikipedia.org/wiki/Version_control * https://en.wikipedia.org/wiki/Pipeline_%28software%29 DONE * https://www.bmc.com/blogs/deployment-pipeline/ ### Papers: * Hey ML, what can you do for me * **Machine learning Pipelines, From research to production** * MLModelScope Evaluate and Introspect Cognitive Pipelines * MLOPS A SICBased Minimum Frame Length * On the Co-evolution of ML Pipelines and Source Code * PipelineProfilerAVisualAnalyticsToolfortheExplorationofAutoMLPipelines * Sustainable MLOps Trends and Challenges * Test Automation with Grad CAM Heatmaps Pipe Segment * Towards MLOps A Case Study of ML Pipeline * When DevOps meets Meta-Learning a portfolio to rule them all Viele Richtungen ausweitbar und researchen was andere Teams umsetzen können. Was greifbares wäre gut: zB technische umsetzung von Pipeline auf prozesslevel und technischem level die weitestgehend automatisiert ist. Am Ende soll ein ML Entwickler der die Models trainiert einen prozess durchläuft damit ers via pipeline zur verfügung stellen kann. Eckdaten/Hinweise von Moritz: * wie satisfy ich meine stakeholder (analyse) * tooling research, welche tools? * worauf muss ich aufbauenn? * welche teams machen da mit? * mit ihnen reden * Umgesetzt ists schnell (1 Woche) * debugging könnte länger dauern() * #### Watchlist https://www.youtube.com/watch?v=_xH7mlDGb0c ### Gespräche mit Phillip Gehardt Früher chaos. Trainieren von Modellen von RnD abteilung. Der Aufwand war immer die Daten sauber zu bekommen. Es passieren of missmatches weil man schlampig arbeitet. Verbessern: Am sinnvollsten bei Datensäuberung. Oberfläche um schnell nach Teilen zu suchen. Eine GUI mit Filtermöglichkeit links und rechs um teile/datensätze zu vergleichen. Um das Problem von Fehlende und inkonsistente Daten zu vergleichen. Masterdaten vom Kunden varrieren und dann noch die Bilddaten zu den Masterdaten vom Kunden zuzuordnen. -> Habe ich für alle Masterdaten Bilder und haben alle Bilder ein Masterdatenset zugeordnet. Deployment ist einfach. Datensäuberung ist das wichtigere. Große Datenmengen, ungenaue Datenmengen. Matchen teilweise nicht exakt. Matchen von 2 Datensätzen wird die Herausforderung. Downscaling von Bilder damits schneller angezeigt wird. Empfehlung mit Tomas Kazmarl. **Georg Fischer** 1 Woche nur Datencleanup von 2 Leuten nur für ein Proof of Concept. Wird bei Deutsche Bahn etc ungleich schwieriger/Zeitaufwändiger. #### Data Cleanup Daten: Nicht nur Bilddaten sondern auch Masterdaten. Part ID, **Hierarchiedaten**, (Names description, externe ID, SAP daten, Kategorie) etc. das ist immer abhängig vom Kunden. Auch datenrefinement von Kundendaten nicht nur unsere. ZB Abkürzungen weg. Interessant wäre zusammenführen von Spalt. Für datensätze an den **Bernd** wenden. Interessant wäre ev. Georg Fischer. Häufig vorkommender Usecase: Kunden schicken Datensatz und später noch ein Update. Hat oft anderes Format. Problem: Spalten fehlen, sind umbenannt, zusammengeführt. Usecase: Statistiken über daten und herausfinden wo daten fehlen. ZB Masterdatensätze haben alle Bilderdatenzugeordnet Usecase: Warehouse ID auch reinmappen Gewünscht: Daten mit JSON verarbeiten. Keine DB. Herummanipulieren. Thomas Kazmar noch kontaktieren bezüglich des Projekts und David Geronimo. ### Tomas Kazmar & Albert Trias Mansilla Thomas Kazmar: Overview: We as RnD are in the process of pushing this process of servicing a customer which includes data cleanup, to provision-services. They have something like a pipeline. A repo with a set of python scripts that allow to move the data from the customer to the app. JSON but NOSQL. Albert Trias Mansilla member of RnD is a data engineer and currently does the data sanitization. Or talk to Michael Probst. One direction that might be usefull on the side of cleaning up the images. Different sources give different quality of data. Albert: Customers provide data in different formats. In case of the images there are many errors because they get loaded to an ftp server. [But isnt this too variable? Maybe solve this problem for one specific customer?] Other Project ideas: Pipeline part: Apache airflow, nifi, But what specifically aws triggers lambda functions ??? #### IMAGE IMPORTER Bounding boxes are sort of automated but still buggy if ran in parallel. -> Some sort of coordinator Interacting with #### DB DATA SANITIZING We know that we have variable datasets. At the same tiem bigger datasets are full of issues. Biggest Set was from Deutsche Bahn. Several steps that need to be done: * small specifications * all revolving around images * corrupted * duplicated across different classes/products * appearence mixtures * if one id has multiple parts it causes a minor problem. this was ignored until now #### Business intelligence Advantage for the company: doesn't block the other guys Thirda party tool called (**label studio**)[https://labelstud.io/]. It is a nice tool and will be used. But its needs to be integrated into our system. A inteligence tool that compares different annotators to figure out the quality of each. * Google vision * amazon quicksight(?) * Using Google Data Studio as interface to display the data/resources or some of those tools: * https://datastudio.google.com/navigation/reporting * https://cloud.google.com/bigquery/ * https://aws.amazon.com/de/quicksight/ * https://aws.amazon.com/de/athena/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc Result should be that this tool informs us how the data needs to be prepared before sending it to different anotators. # Konkretes zur 1. Arbeit | Verbesserung der Pipeline (verworfen) ## Formulierung der Frage(n) und des Projekts: Um zu erroieren was überhaupt gemacht werden muss bzw wo ich mit der Arbeit rund um eine eventuelle Toolchain ansetzen kann, muss ich erst den Stand der Dinge ermitteln. Dies kann über verschiedene Verfahren gemacht werden. In meinen Augen wären die Erfolgversprechendsten: * Interviews mit den Mitarbeitern * Fragebögen * Manuelle Nachforschungen im Intranet Welche Fragen gilt es zu klären? * Wie sieht der Lifecycle im Moment aus? * Welche Tools werden verwendet? * In welcher Form werden AI-Modelle abgeliefert? (Seite ML-Entwicklung) * In welcher Form werden AI-Modelle benötigt? (Seite Endprodukt-Entwicklung) Insofern wäre das 1. Paper eine Art Datenefassung und die grobe Frage: *Wie ist der Stand der Dinge in der Firma Humai im Bezug auf die Entwicklung/Lifecycle von ML-Software Produkten, und was wäre nötig um eine komplette Pipeline von Erstellung etwaiger Datensätze bis zum Rollout eines Produkts, sowie die ununterbrochene Weiterentwicklung und Unterstützung jenes Produkts zu gewährleisten?* Und die Frage bzw das Thema des 2. Papers: *Wie kann man einen Teilaspekt dieser Pipeline welche im 1. Paper theoretisch erarbeitet wurde, in die Firmenstruktur von Humai praktisch umsetzen?* Und das Projekt wäre dann natürlich die Umsetzung besagten Aspekts. Praktisch wäre die Toolchain/Pipeline da es da eventuell wirklich etwas zu programmieren gäbe. ## ? Um DEV-Ops Sinnvoll einsetzen zu können muss die Software einige Anforderungen wie: * deployability * modifiability * testability * monitorability erfüllen. Ähnliche Kriterien werden für ML-Ops wichtig sein. (https://en.wikipedia.org/wiki/Architecturally_significant_requirements) ## Ziele des Endprodukts * verbesserte Deployment-Frequency * schnellere Zeit bis ein Produkt am Markt ist * Geringere Fehlerquote in neuen Produkten * Kürzere Zeit bis Dinge gefixed sind * Schnellere recovery-time bei einem Crash ### Good Quotes IT performance can be measured in terms of throughput and stability. Throughput can be measured by deployment frequency and lead time for changes; stability can be measured by mean time to recover. The State of DevOps Reports found that investing in practices that increase these throughput and stability measures increase IT performance ( Nicole Forsgren; Gene Kim; Nigel Kersten; Jez Humble (2014). "2014 State of DevOps Report" (PDF). Puppet Labs, IT Revolution Press and ThoughtWorks. Retrieved 27 April 2019. || "2015 State of DevOps Report" (PDF). Puppet Labs, Pwc, IT Revolution Press. 2015. Retrieved 6 May 2019) # Konkretes zur 2. Arbeit | Business Interligence Tool BIT According to the brainstorming with Tomaz and Albert I delved into the development of a business interlligence tool as my second project which I also want my 2 Bachelors thesis relate to. ## What is the purpouse of this Programm? According to Tomas: In short this tool should provide various summaries of the running annotations. Which projects are currently being annotated, when approx they might be finished, so that we know when we need to create more, what is the throughput, so that we can easily check if we are using up the available bandwidth, how are the individual annotators performing. Then on the level of different annotation types (i expect we will have 5-10), we will want to know the overall status - e.g., how many tasks where skipped. If this would not be complex enough, we can also think about how to optimize task allocation - e.g., inserting quality checks. One think to keep in mind is also that there are multiple annotation servers running different annotation projects. ## How is this programm going to be made? Making a GUI from scratch would take too much time according to my collegues so they suggested I could use GoogleDataStudio as iterface to display the data gathered from different sources. ## What are possile research questions linked to this work? * What are annotations and why are they required to train an AI-Model? * How can the company Partium benefit from a business intelligence tool such as BIT? ## Discussion 28.09.2021 Multiple istaces One inside compay network and 2 of them are in AWS Methods "Types of annotation": Bounding Boxes and Tags how long does each take? How fast is each method? You get a pice of text and mark/lable a certain passage **Divig into the result and metadata is not required for this intelligence tool** D.h. es gibt eine Datensammlung? Yes BUT we don't have the data yet. We need to extract it, transform it and parse it into google data studio (the front end) Die Logik wofür? Hardest thing is collecting, parsing and transforming of data. Also grouping of data the easiere understanding it. Also outputting it in the coorect format(csv) ## Call with Radinger-Peer Prof meinte das Projekt passt nachdem ich ihm das Requirement-Doc gezeigt habe bezüglich Arbeit/Paper meinte er reicht das aber nicht und hat folgendes Vorgeschlagen: **"Performance Analyse & Verbesserung für ein AI basiertes Labeling tool"** das bedeutet folgende Aufgaben bis zum nächsten Call (wannauchimmer ich mir den ausmache): 1. Literaturrecherche 2. was ist labeling 3. benchmarking 4. wie evaluiert man solche systeme von der Hard & Software seite? 5. Inhaltsverzeichnis der Arbeit ## Call with Anton & Tomaz ### Questions: #### Sent on the 15.10. ~17.00 * What do i need?: * How many Servers are there? * How do i access them? * Ip * Username * PW * What are the files on each that i need to parse and where are they? * How do I access those files? * Https * SSH * other * Without any base data i cant get familiar with Google Data Studio * General Questions: * Are all annotations done automatically or which of the 4 processes are used? * Pre-Lableing * Auto-Labeling * Online-Learning * Active Learning * Manual Labeling * How do the used annotations work? * Comparison/Cost Analysis just for FH or actually required for the company? * Requirements for Boards from where? Make them up myself? Get guided by you guys? Answer: [15.10.2021 17:14] Albert Trias Mansilla let me answer some of these questions:main server is: http://labelstudio.humai.tech/projects, we have 2 more, for now just consider that you can have N..you can register in each one, and obtain a token to access through the apiyou should be able to get the data of the projects through: https://gitlab.imagination.at/r_and_d/annotation-scripts/-/blob/feature/engel-query-images/label_studio.py#L206 and you can get the projects available in an instace with: https://gitlab.imagination.at/r_and_d/annotation-scripts/-/blob/feature/engel-query-images/label_studio.py#L408imho we do not need a "separate intermediate server" we can just the google equivalent of "AWS lambda functions" [15.10.2021 17:15] Albert Trias Mansilla annotations are done manually by humans [15.10.2021 17:16] Albert Trias Mansilla imho it would be good that we explain a few things, also I guess that our role is not just as "customers", and maybe we can help with planning, suggestions, etc.... [15.10.2021 17:27] Albert Trias Mansilla - to add the data to google data studio I would use big query reading a google bucket [15.10.2021 17:29] Albert Trias Mansilla if I remember well, first we should choose a cloud provider, althought that my preference is Google (I know the solutions of gc and aws) #### Sent on the * Which Databases are used? SQLite or PostgreSQL ### Notes: * Possible languages encountered: * Python * Django * React * JavaScript * MST * SQLite * PostgreSQL # Start des Projekts 17.01.2022 Nach 2 Monaten ist die NDA endlich unterschrieben. Folgende Tasks habe ich von Albert bekommen die ich machen kann ohne die unterschriebene NDA: * get familiar with google data studio, create som data & play with it * did as far as i could back in november with some spreadsheet i had * Prepare a Mock with Google Data Studio that would be used to gather requirements * would need to know what data fields to expect * time it took to finish X jobs * expected time it will take to complete a batch * total jobs/batches running * Talk with different stakeholders to obtain the requirements * Who are the stakeholders? * "me" * Albert * Moorea * people from R&D * 1 Mockup from myself, 2. talk with Moorea & improove Mockup, 3. talk with Albert and dfinish up the mockup * learn how to use Big Query * uses SQL * do we realy need this/are we already paying for this? * what are the advantages opposed to using our own infrastructure? * advantage: * just pass a file and it allows to sql query through the files * wont have to maintain you own DB * better scalabilty * other sources or data: * idally via sourcefile to loop through with the script