---
title: Presentation
date: 9 Mar 2023
---
# :spiral_note_pad: Metadata Library Intorudction
:::success
:dart: **Goals**
1. **Extract** all the information from the `metadata*.xml`
2. **Tranform** it, i.e. clean it
3. **Load** to the database.
:::
# :building_construction: How
```mermaid
graph LR
N --Api--> Havester((Havester))
E --Api--> Havester((Havester))
R --Api--> Havester((Havester))
C --Api--> Havester((Havester))
Havester --metadata*.xml--> Parser{XmlParser/XmlSerializer}--ETL-->Database[(Database)]
Database -.-> EF{EFCore} -.-> SearchBar
EF -.-> Database
SearchBar -.-> EF
```
# :gear: Serialization/Deserialization
- `XmlSerializer`
- `C#`
- [`XmlSerializer`](https://learn.microsoft.com/en-us/dotnet/api/system.xml.serialization.xmlserializer?view=net-7.0)
- `XmlParser`
- `Python`
- [`bs4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [`lxml`](https://lxml.de/tutorial.html)
- [...and more](https://realpython.com/python-xml-parser/)
- `C#`
- [`System.Xml.XElement`](https://learn.microsoft.com/en-us/dotnet/api/system.xml.linq.xelement?view=net-7.0)
# :gear: EF core
## :information_source: What is EF core
- [Entity Framework (`EF`)](https://learn.microsoft.com/en-us/ef/core/get-started/overview/first-app?tabs=netcore-cli)
- A handy library for programmer to communicate with databases
- ~ to [`SQLAlchemcy`](https://www.sqlalchemy.org/) in `Python`
## :information_source: How to be use?
```csharp=
using Microsoft.EntityFrameworkCore;
using System;
using System.Collections.Generic;
public class BloggingContext : DbContext
{
public DbSet<Blog> Blogs { get; set; }
public DbSet<Post> Posts { get; set; }
public string DbPath { get; }
public BloggingContext()
{
var folder = Environment.SpecialFolder.LocalApplicationData;
var path = Environment.GetFolderPath(folder);
DbPath = System.IO.Path.Join(path, "blogging.db");
}
// The following configures EF to create a Sqlite database file in the
// special "local" folder for your platform.
protected override void OnConfiguring(DbContextOptionsBuilder options)
=> options.UseSqlite($"Data Source={DbPath}");
}
public class Blog
{
public int BlogId { get; set; }
public string Url { get; set; }
public List<Post> Posts { get; } = new();
}
public class Post
{
public int PostId { get; set; }
public string Title { get; set; }
public string Content { get; set; }
public int BlogId { get; set; }
public Blog Blog { get; set; }
}
```
## :o: Pros
- No need to write SQL
- No need to explicitly write relation, it is declared by the class (entity)
- 1-to-1 relationship, e.g. 1 post belongs to 1 blog,
- 1-to-$n$ relationship, e.g. 1 blog could have $n$ posts,
- Programmer can use the query data as an object
- i.e `blog.Posts` will get you a list of post in the database
- You can easily change to different relational database without rewriting the code.
- currently we are using `Sqlite` and trying to migrate to `PostgreSql` which is provided by `aws-rds`
## :x: Cons
- Given you have some `C#` experience, **it takes time to learn.**
- It **could be slow when the class is complex**, i.e. if the classes have a lot of inheritance/polymorphism
- While it claims you do not need to write SQL, actually during the initial setup, you will probably need to write SQL anyway. Especially when you have complex/weird relationship between tables.
## :thinking_face: Code 1st or Database 1st?
::::info
:bulb: **Code 1st**: Write the classes first. Design the classes/relationships to init the database then load the data in the database by
:::warning
:heavy_exclamation_mark: Exactly what :older_adult: Vasili is doing in the metadata library.
:::
::::
::::info
:bulb: **Database 1st**: Get the data first, load it to the database as it is. write the classes/relationships according to the tables.
:::warning
:snake: **`Pythonically`**:
1. Load each `xml` object i.e.`<sth>sth_value</sth>` into different dataframe (in RAM)
2. Keeping track of metadata `fileIdentifier` of each dataframe.
3. Load the dataframes to databse
:::
:::spoiler
**Example**
suppose our `example_metadata.xml` looks like this
```xml!
<MD_Metadata>
<fileIdentifer>aaa-bbb-ccc-111-222-333</fileIdentifer>
<title>dummy_title</title>
<CI_Contact>
<name>dummy_contact_name</name>
</CI_Contact>
</MD_Metadata>
```
In python you will have two dataframes
`MD_Metadata_df`
| uid | fileIdentifer | title | contact |
|:--------------:|:-----------------------:|:-----------:|:-------------:|
| metadata_uid_1 | aaa-bbb-ccc-111-222-333 | dummy_title | contact_uid_1 |
`CI_Contact_df`
| uid | fileIdentifer | name |
|:-------------:|:-----------------------:|:------------------:|
| contact_uid_1 | aaa-bbb-ccc-111-222-333 | dummy_contact_name |
Note that `uid` can be assigned by either database/programmer. With this approach, we can have the database populated first and, at the minimum, you will know which one belong to which one.
:::
::::