Presentation - HackMD

--- title: Presentation date: 9 Mar 2023 --- # :spiral_note_pad: Metadata Library Intorudction :::success :dart: **Goals** 1. **Extract** all the information from the `metadata*.xml` 2. **Tranform** it, i.e. clean it 3. **Load** to the database. ::: # :building_construction: How ```mermaid graph LR N --Api--> Havester((Havester)) E --Api--> Havester((Havester)) R --Api--> Havester((Havester)) C --Api--> Havester((Havester)) Havester --metadata*.xml--> Parser{XmlParser/XmlSerializer}--ETL-->Database[(Database)] Database -.-> EF{EFCore} -.-> SearchBar EF -.-> Database SearchBar -.-> EF ``` # :gear: Serialization/Deserialization - `XmlSerializer` - `C#` - [`XmlSerializer`](https://learn.microsoft.com/en-us/dotnet/api/system.xml.serialization.xmlserializer?view=net-7.0) - `XmlParser` - `Python` - [`bs4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - [`lxml`](https://lxml.de/tutorial.html) - [...and more](https://realpython.com/python-xml-parser/) - `C#` - [`System.Xml.XElement`](https://learn.microsoft.com/en-us/dotnet/api/system.xml.linq.xelement?view=net-7.0) # :gear: EF core ## :information_source: What is EF core - [Entity Framework (`EF`)](https://learn.microsoft.com/en-us/ef/core/get-started/overview/first-app?tabs=netcore-cli) - A handy library for programmer to communicate with databases - ~ to [`SQLAlchemcy`](https://www.sqlalchemy.org/) in `Python` ## :information_source: How to be use? ```csharp= using Microsoft.EntityFrameworkCore; using System; using System.Collections.Generic; public class BloggingContext : DbContext { public DbSet<Blog> Blogs { get; set; } public DbSet<Post> Posts { get; set; } public string DbPath { get; } public BloggingContext() { var folder = Environment.SpecialFolder.LocalApplicationData; var path = Environment.GetFolderPath(folder); DbPath = System.IO.Path.Join(path, "blogging.db"); } // The following configures EF to create a Sqlite database file in the // special "local" folder for your platform. protected override void OnConfiguring(DbContextOptionsBuilder options) => options.UseSqlite($"Data Source={DbPath}"); } public class Blog { public int BlogId { get; set; } public string Url { get; set; } public List<Post> Posts { get; } = new(); } public class Post { public int PostId { get; set; } public string Title { get; set; } public string Content { get; set; } public int BlogId { get; set; } public Blog Blog { get; set; } } ``` ## :o: Pros - No need to write SQL - No need to explicitly write relation, it is declared by the class (entity) - 1-to-1 relationship, e.g. 1 post belongs to 1 blog, - 1-to-$n$ relationship, e.g. 1 blog could have $n$ posts, - Programmer can use the query data as an object - i.e `blog.Posts` will get you a list of post in the database - You can easily change to different relational database without rewriting the code. - currently we are using `Sqlite` and trying to migrate to `PostgreSql` which is provided by `aws-rds` ## :x: Cons - Given you have some `C#` experience, **it takes time to learn.** - It **could be slow when the class is complex**, i.e. if the classes have a lot of inheritance/polymorphism - While it claims you do not need to write SQL, actually during the initial setup, you will probably need to write SQL anyway. Especially when you have complex/weird relationship between tables. ## :thinking_face: Code 1st or Database 1st? ::::info :bulb: **Code 1st**: Write the classes first. Design the classes/relationships to init the database then load the data in the database by :::warning :heavy_exclamation_mark: Exactly what :older_adult: Vasili is doing in the metadata library. ::: :::: ::::info :bulb: **Database 1st**: Get the data first, load it to the database as it is. write the classes/relationships according to the tables. :::warning :snake: **`Pythonically`**: 1. Load each `xml` object i.e.`<sth>sth_value</sth>` into different dataframe (in RAM) 2. Keeping track of metadata `fileIdentifier` of each dataframe. 3. Load the dataframes to databse ::: :::spoiler **Example** suppose our `example_metadata.xml` looks like this ```xml! <MD_Metadata> <fileIdentifer>aaa-bbb-ccc-111-222-333</fileIdentifer> <title>dummy_title</title> <CI_Contact> <name>dummy_contact_name</name> </CI_Contact> </MD_Metadata> ``` In python you will have two dataframes `MD_Metadata_df` | uid | fileIdentifer | title | contact | |:--------------:|:-----------------------:|:-----------:|:-------------:| | metadata_uid_1 | aaa-bbb-ccc-111-222-333 | dummy_title | contact_uid_1 | `CI_Contact_df` | uid | fileIdentifer | name | |:-------------:|:-----------------------:|:------------------:| | contact_uid_1 | aaa-bbb-ccc-111-222-333 | dummy_contact_name | Note that `uid` can be assigned by either database/programmer. With this approach, we can have the database populated first and, at the minimum, you will know which one belong to which one. ::: ::::