Building a content aggregator with modern Python

# Building a content aggregator with modern Python I've been a user of feedly for quite some time now, helping me manage my information consumption. Particularly features like muting/filtering, are highly appreciated. Motivated by a recent RealPython tutorial, I've decided to make my own content aggregator and write about the journey. Since this series will be a documentation of the journey, there will be quite some reorganization of the code along the way, as I want it to keep some of the journal flair and authenticity. This means that I don't expect to nail down every single detail and present a polished version at each step. The journey will be solving problems and adding features step by step. In other words, I'll not overengineer it. Will try to make smart calls, but the priority is to get things done and be as simple as possible. This is a learning experience, on building a somewhat larger piece of software. As such, I'll focus on steps that are interesting to me and will not explain every single individual step. Moreover, I'll be focusing on learning potential and experience over scalability and resilience. My objective is to make an application exploiting all the features available in modern Python, code quality, and modern development tools. This means * Heavy use of type annotations, not just for static analysis * Asynchronous when possible * Hopefully avoiding huge JS UI frameworks * Cloud operations * Use NLP to provide summaries and tagging * Analytics on reading patterns * Automation Some of the features I'd like this application to have: - Automatic pooling of feeds - Simple navigation marking entries read already - Filter/mute topics - Duplication removal - Auto tagging And at first, it'll have some constraint - single user - Since it's an application, will be developed and tested against a single Python version. These lists are not exhaustive, and the features one is in no way in any order of priority. ## Getting started Settign up a Python project can be tricky, and quite fancy depending on how many different things are to be added. Here, projects like cookiecutter or pyscaffolding can be of great help. And although I find them in general great, I've decided against using them. The main reason is to have better control and learn in the process. To set things up, I'll use `poetry` to manage the development. I'll not defend this choice, and alternatives should be easily used. This decision might be reconsidered once a more data oriented part of the project emerges where `conda` might be a better solution. Then, will set some "basics", namely, `black`, `flake8` (with some extensions), `mypy`, `pyupgrade`, `bandit`, `isort`, `pytest`. Furthermore, most of these tools will be part of the tools executed by `pre-commit` together with other tools. And at last, setting up some GitHub Actions to provide continuous integration and other extra automations. ## First Model and database layer The first step will be modeling. Keeping it minimal, what we have is an `Entry` which is one element of a RSS feed. Here another decision has to be made, as to which ORM to use. `SQLAlchemy` would be the obvious choice, is the powerhouse of ORMs in Python. The hybrid version 1.4/2.0 even supports asynchronous operation mode. But, will try to be adventurous and give another ORM a try. Here I found tqo options: [`Piccolo`](https://piccolo-orm.com), [`Tortoise ORM`](https://tortoise.github.io). In my opinion the biggest disadvantage of `Piccolo`, is that only supports Postgres and limited SQLite. Also, `Tortoise ORM` seems somewhat more mature. ### ORM Interlude Now, I did follow `Piccolo`'s instructions and for my taste it had to much code generation and felt bloated. And this is the same feeling that `Django` evokes and why I personally don't like that much. Then I did try `Tortoise ORM`, which seemed very promising, but I didn't get it to work with a `pytest` based test suite. I'll write about about that experience for documentation purposes, maybe it'll help someone. Adding it `poetry add tortoise-orm[aiosqlite]` with the extra `aiosqlite` option, since at first, development will be made agaisnt `SQLite` the respective async driver needs to be present. We can make a model, with the following content. ```python # nol/backend/models.py from tortoise.models import Model from tortoise import fields class Entry(Model): # Defining `id` field is optional, it will be defined automatically # if you haven't done it yourself id = fields.IntField(pk=True) title = fields.CharField(max_length=255) content = fields.TextField() def __str__(self) -> str: return f"'{self.title}' from feed '{self.rss_name}'" ``` Now, let's make sure we can interact with the database via `Tortoise ORM` and the defined `Entry` model. To achieve this, we can create a simple roundtrip test using `pytest` as a test framework and executor and following the documentation [instructions](https://tortoise.github.io/contrib/unittest.html#py-test) leading to a `conftest.py` file looking as follows ```python # tests/backend/conftest.py import os import pytest from tortoise.contrib.test import finalizer, initializer @pytest.fixture(scope="session", autouse=True) def initialize_tests(request): db_url = os.environ.get("TORTOISE_TEST_DB", "sqlite://:memory:") initializer(["nol.backend.db.models"], db_url=db_url, app_label="models") request.addfinalizer(finalizer) ``` > Unfortunately, the documentaiton available is rather sparse around testing. Here I think `Piccolo` was better (at the time of writing). The extremely simple test just creates an `Entry` object and attempts to save it into the in memory `SQLite` database. ```python # tests/backend/db/test_models.py import pytest from nol.backend.db.models import Entry async def test_entry_model(): entry = Entry(title="Awesome post", description="And awesome post on a new project") await entry.save() ``` When executing `pytest` the first hurdle shows up. The package `asynctest` is needed. When attempting to run the test now, another issue arises. Again, around async requirements, making the test to be skipped. This time the mesage says ``` PytestUnhandledCoroutineWarning: async def functions are not natively supported and have been skipped. You need to install a suitable plugin for your async framework, for example: - anyio - pytest-asyncio - pytest-tornasync - pytest-trio - pytest-twisted ``` Installing the first didn't help, neither installing the second, or both. Time to hunt for an answer, which according to [this](https://stackoverflow.com/questions/45410434/pytest-python-testing-with-asyncio) SO answer, the async test function needs to be decorated with `@pytest.mark.asyncio`. Which helped, at least it was executed, but now there is a conflict on the `finalizer` part of the `tortoise` setup. Which actually never let `pytest` to finish. Also, there are a lot of warnings being raised. > Lesson, it might be that the async ecosystem in Python is still rather immature. Now, I won't attempt go into a wild chase. It would have been great to continue with some synchronous operation mode within this ORM, but as far as I could see there is not such posibility with `tortoise ORM`, whereas it was the case with `Piccolo`. Having to abandon `tortoise` is a shame since in their documentation it's shows how to use it alongside `pydantic` and `FastAPI`, which is what I intended to use. ## Back to what (should) work Given the failure with an async first ORM, I decided for the sake of moving on to go with plain `SQLAlchemy`, but still trying it async mode. Unfortunately, at the time of writing, `SQLAlchemy` 2.0 hasn't been released yet. This means using version 1.4.x which is a transition one with async elements incorporated already as stated in the tutorial. > There are two "alternatives" to `SQLAlchemy`, [`ormar`](https://collerek.github.io/ormar/), and [`SQLModel`](https://sqlmodel.tiangolo.com). Both are wrappers over `SQLAlchemy` that offer slightly similar things: a modern feeling, a good integration between `pydantic` and `SQLAlchemy`, less ceremony code. But first, I'll start with plain `SQLAlchemy`. The simplest `Entry` model imaginable could be ```python # nol/backend/db/models.py from sqlalchemy import Column, Integer, String from sqlalchemy.orm import declarative_base Base = declarative_base() class Entry(Base): id = Column(Integer, primary_key=True) title = Column(String(30), nullable=False) content = Column(String) guid = Column(String(200), nullable=False) def __repr__(self): return f"User(id={self.id!r}, title={self.name!r})" ``` and we could create a simple test with all the scaffolding for the test as follow ```python # tests/backend/conftest.py import pytest from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker, scoped_session from nol.backend.db.models import Entry, Base @pytest.fixture(scope="session") def connection(): engine = create_engine("sqlite+pysqlite:///:memory:", future=True) Base.metadata.create_all(engine) return engine.connect() @pytest.fixture def db_session(connection): transaction = connection.begin() yield scoped_session( sessionmaker(autocommit=False, autoflush=False, bind=connection) ) transaction.rollback() ``` This scaffolding might be done way better, but for now it is enough. And as for the simple test itself ```python # test/backend/db/models.py from nol.backend.db.models import Entry def test_example(db_session): entries = [ Entry(title="First", content="My first one"), Entry(title="Second", content="A second one"), Entry(title="Third", content="meg"), ] db_session.add_all(entries) db_session.commit() count = db_session.query(Entry).count() assert count == 3 ``` Executing `pytest` gives us a passing test! Now that we can test, we can extend the models. To start with there are two relevant models needed. One for the entries as already shown in a simplified version, and the source for the entries, namely, the feed. ### dataclasses I'm a big fan of dataclasses, and to my surprise, `SQLAlchemy` does support [registering a `dataclass`](https://docs.sqlalchemy.org/en/14/orm/mapping_styles.html#declarative-mapping-with-dataclasses-and-attrs) as a model. So I was drawn to try it out, choosing the [declarative](https://docs.sqlalchemy.org/en/14/orm/mapping_styles.html#example-two-dataclasses-with-declarative-table) way. ```python from dataclasses import dataclass, field from sqlalchemy import Column, Integer, String, DateTime from sqlalchemy.orm import registry mapper_registry = registry() @mapper_registry.mapped @dataclass class Entry: __tablename__ = 'entries' __sa_dataclass_metadata_key__ = "sa" id: int = field(init=False, metadata={"sa": Column(Integer, primary_key=True)}) title: str = field(metadata={"sa": Column(String(50), nullable=False)}) content: str = field(metadata={"sa": Column(String(50), nullable=False)}) guid: str = field(metadata={"sa": Column(String(200), nullable=False)}) ``` By itself, we don't win much, basically no need of having to define a `__repr__` or `__str__` to have a nice representation. I'd expect that comparison and equality of model instances should work as expected. At face value, it looks more verbose than the previous definition, and here is where something like `SQLModel` shines. Since I'm curious how far this dataclass approach can be used, I'll continue with it. From here should be pretty straightforward. ## First full Entry It's time to have a more complete `Entry` model. It won't be the most complete yet, for now will keep out the relationship to a still not defined `Feed` model, and all kinds of interaction related fields like if it was read. With the model in a more complete form, we have to start populating it from a feed. For this we'll use the `feedparser` library. When a feed is parsed, `feedparser` returns a list of entries. These entries are dictionary like objects of the type `FeedParserDict` which implements attribute-like access to the keys which makes the user experience much better. The first thing to do, is to extract from this returned dictionary some values and put them into a new class representing a feed entry. Since some of the fields in the original parsed entries might also be of such dictionary-like type, it might be also interesting to extract and flatten that structure. Another reason to make a new class, is to isolate and have control on what I'm propagating. Also, I don't want to be foreced to deal with database related fields right away. There is already one class representing a feed entry, at the level of the database. But for now lets keep them separate. This might change in the future. Staying within the standard library, we can use `dataclasses`. For now, I'll trust the feed returned and not doo validations and currently I'm not expecting to do JSON serialization. Once one or both of those assuptions change, it might be time to change and go for something like `attrs` or `pydantic`. The parsed entries contain plenty of keys <table> <tr> <td> ``` author author_detail comments content contributors created created_parsed enclosures expired expired_parsed id license link ``` </td> <td> ``` links published published_parsed publisher publisher_detail source summary summary_detail tags title title_detail updated updated_parsed ``` </td> </tr> </table> To start with, not all of them are going to be considered. The most notable field left aside at the moment will be the `content` one. As described in the documenation, > entries[i].content >A list of dictionaries with details about the full content of the entry. >Atom feeds may contain multiple content elements. Clients should render as many of them as possible, based on the type and the client’s abilities. > which makes this particular key something that will need more attention. ```python # nol/backend/feedconsumer/parser.py from __future__ import annotations from dataclasses import dataclass, field import datetime import feedparser from dateutil.parser import parse as date_parse @dataclass class ParsedEntry: title: str link: str author: str published: datetime.datetime summary: str = field(repr=False) guid: str = field(repr=False) @classmethod def from_feedparser_entry(cls, entry: feedparser.FeedParserDict) -> ParsedEntry: return cls( title=entry.title, link=entry.link, author=entry.author, published=date_parse(entry.published), summary=entry.summary, guid=entry.guid, ) ``` NOTE: Move to python 3.10 and use keyword_only option for dataclass to make it behave closer to pydantic? Now that we know which fields we want to use, the `Entry` model needs to be adapted. Furthermore, a way to take a `ParsedEntry` and turn it into an `Entry` is needed. ```python # nol/backend/db/models.py from __future__ import annotations from dataclasses import dataclass, field from datetime import datetime from sqlalchemy import Column, Integer, String, DateTime from sqlalchemy.orm import registry from nol.backend.feedconsumer.parser import ParsedEntry mapper_registry = registry() @mapper_registry.mapped @dataclass class Entry: __tablename__ = 'entries' __sa_dataclass_metadata_key__ = "sa" id: int = field(init=False, metadata={"sa": Column(Integer, primary_key=True)}) title: str = field(metadata={"sa": Column(String(50), nullable=False)}) link: str = field(metadata={"sa": Column(String(200), nullable=False)}) author: str = field(metadata={"sa": Column(String(50), nullable=False)}) summary: str = field(repr=False, metadata={"sa": Column(String(3000))}) date_published: datetime = field(metadata={"sa": Column(DateTime, nullable=False)}) date_added: datetime = field(metadata={"sa": Column(DateTime, nullable=False)}) @classmethod def from_parsed_entry(cls, parsed_entry: ParsedEntry) -> Entry: return cls( title=parsed_entry.title, link=parsed_entry.link, author=parsed_entry.author, summary=parsed_entry.summary, date_published=parsed_entry.published, date_added=datetime.now(), guid=parsed_entry.guid, ) ``` and ```python # test/backend/db/models.py from datetime import datetime from nol.backend.db.models import Entry from nol.backend.feedconsumer.parser import ParsedEntry def test_from_parsed_entry(db_session): parsed_entry = ParsedEntry("Title", "a link", "Myself", datetime.now(), "A good day", "GUID!") entry = Entry.from_parsed_entry(parsed_entry) db_session.add(entry) db_session.commit() retrieved_entry = db_session.query(Entry).filter_by(title="Title").first() assert retrieved_entry.id == 1 ``` Now, taking a cue from the RealPython tutorial, I'm going to create a small task to populate the database. Starting with the most naive approach, some database related functions are needed. ```python # nol/backend/db/__init__.py from sqlalchemy.engine import create_engine from sqlalchemy.orm import sessionmaker from nol.backend.db.models import mapper_registry def get_engine(): engine = create_engine("sqlite+pysqlite:///:memory:", future=True) mapper_registry.metadata.create_all(engine) return engine ``` ```python #nol/backend/tasks/populate.py import feedparser import sqlalchemy from sqlalchemy.orm import Session from nol.backend.db import get_engine from nol.backend.db.models import Entry from nol.backend.feedconsumer.parser import ParsedEntry def populate(url: str, session: Session): feed = feedparser.parse(url) parsed_entries = (ParsedEntry.from_feedparser_entry(entry) for entry in feed.entries) entries = (Entry.from_parsed_entry(parsed_entry) for parsed_entry in parsed_entries) for entry in entries: if session.query(Entry).filter(Entry.guid == entry.guid).scalar() is None: session.add(entry) session.commit() ``` And let's add a test ```python # nol/tests/backend/tasks/test_populate.py from nol.backend.tasks.populate import populate from nol.backend.db.models import Entry def test_populate(db_session): populate("https://www.theverge.com/rss/index.xml", db_session) assert db_session.query(db_session.query(Entry).count()).scalar() == 10 def test_populate_with_existing_entries(db_session): populate("https://www.theverge.com/rss/index.xml", db_session) assert db_session.query(db_session.query(Entry).count()).scalar() == 10 populate("https://www.theverge.com/rss/index.xml", db_session) assert db_session.query(db_session.query(Entry).count()).scalar() == 10 ``` The written test is basically an integration test, with all its downsides, but for now it serves it purpose. Also, the second test function has a potential although unlikely data race so it could become flaky. As with `pydantic`, all arguments have to be given by keyword. When I started working with `pydantic` this was a behavior that I found slightly irritating, but I understand and also can appreciate the clarity and explicitness it brings. > This tutorial describes a new API that’s released in SQLAlchemy 1.4 known as 2.0 style. The purpose of the 2.0-style API is to provide forwards compatibility with SQLAlchemy 2.0, which is planned as the next generation of SQLAlchemy. > In order to provide the full 2.0 API, a new flag called future will be used, which will be seen as the tutorial describes the Engine and Session objects. These flags fully enable 2.0-compatibility mode and allow the code in the tutorial to proceed fully. When using the future flag with the create_engine() function, the object returned is a subclass of sqlalchemy.engine.Engine described as sqlalchemy.future.Engine. This tutorial will be referring to sqlalchemy.future.Engine. Simple enough. As indicated in the comment before the `rss_name` field, currently the name of the originating RSS feed is stored in the entry. Idealy, a reference to a row in a `Feeds` table should be used. But since it'll not be used for now, we're going to keep this simple.