# Integration of PyGEOS into Shapely: discussion document ## The Geometry classes Currently, Shapely provides a set of classes (Point, Linestring, Polygon, MultiPoint, etc), one for each of the basic geometry types. PyGEOS on the other hand has only a single Geometry extension type. So we need to be either fine with going with such a single class and provide some functionality to smooth the transition, or find workarounds to still have multiple classes. First, some reasons that we might like those different classes: - Different interfaces: currently, each of the classes can have it own set of methods. For example, Point has `x`, `y` and `z` attributes, while Polygon has `exterior` and `interiors` attributes (and some others). - But in the end, most of the methods they have in common, and it are only a few exceptions. Those can also raise an error if they are not applicable for the geometry type in question. - Educational: having those different Point, Linestring, Polygon objects that map to the actual geometry concepts can be nice when explaining this. - But, maybe that doesn't necessarily need different classes? We can still provide those `Point(..)` etc as constructors, and the class can also have just a good repr that depends on the geometry type? - Checking the geometry type with isinstance (eg `isinstance(geom, Polygon)`), compared to checking the geometry type (`geom.geom_type == 'Polygon'`), although this is more code style. - But also here, we can probably provide workarounds to keep such an isinstance-pattern working (eg, the `Polygon` can be a dummy class that returns a polgyon Geometry on construction and does the `geom_type` checking on isinstance call). More on that below. Now, exploring the different options **A single `Geometry` class** A hard break, and in Shapely 2.0 there is only a single class: the `Geometry` extension type. But, there are certainly ways to provide some backwards compatibility. For example, we can keep a few "factory" classes that provide compatible constructors and keep `isinstance` working. Basic (but working) example: ```python import pygeos class GeometryClass(type): def __instancecheck__(self, other): return pygeos.get_type_id(other) == self.type_id class Point(metaclass=GeometryClass): type_id = 0 def __new__(self, x, y): return pygeos.points(x, y) ``` would enable: ```python >>> point = Point(1, 2) >>> point <pygeos.Geometry POINT (1 2)> >>> isinstance(point, Point) True >>> poly = pygeos.box(0, 0, 1, 1) >>> isinstance(poly, Point) False ``` Another important aspect about the single Geometry extension type (defined in C) is to have the **familiar methods and attributes**. Since the method is defined in C, we also need to add such methods and attributes in C. Since coding in C is not the nicest thing, there might be ways to prevent needing this. One possible way to dynamically populate `__dir__` and define a `__getattr__` that dispatches to those via Python (calling python from C), as prototyped in https://github.com/pygeos/pygeos/pull/72. **Type-specific subclasses** If we want to keep the actual subclasses for the different geometry types (and not just the dummy factory classes for some backwards compatibility), there still might be some ways to achieve this: - Subclass in C: - This is possible. But, it would be nice to avoid putting too much special logic in C (to keep the amount of C code limited), so I would rather try to avoid this. Although it might be that this can be templated. - Subclass in Cython? (I would still need to check what advantages / disadvantages such an option has) - Subclass in Python: - This is also certainly possible (it's just a flag to enable to make the base extension type subclassable). - The vectorized functions will happily accept such objects as input (since they are subclasses), but vectorized functions that return new geometries as output won't return such geometries (but just the base class). Which would limit the end-user usability of such classes. - So if we want to go this way, we would need some way to (optionally) create those python subclasses from C to return them. Having the different subclasses would keep `isinstance` working as is. When subclassing in C, the same question about how to add the attributes and methods as for the single class is relevant. --- Personally, I think that a single class will be easier, and I have the feeling that many of the advantages of the multiple classes can also be emulated in the single class (eg have a geometry-type specific repr, still have geometry-type specific constructors, etc). So it might not be worth of doing the effort to create all the subclasses, if it is not needed from an implementation point of view. ## Mutability Currently, some of the geometry classes are mutable. Illustrative code: ```python >>> line = shapely.geometry.LineString([(0,0), (2, 2)]) >>> print(line) LINESTRING (0 0, 2 2) >>> line.coords = [(0, 0), (10, 0), (10, 10)] >>> print(line) LINESTRING (0 0, 10 0, 10 10) ``` If we remove the mutability, the geometries can also become hashable again (xref https://github.com/Toblerity/Shapely/issues/209). Currently, the pygeos.Geometry class is immutable by design. I would propose to keep it this way, and deprecate mutability in Shapely 1.8. ## Iteration / getitem of "Multi" geometries Currently, one can iterate through the "Multi" geometry types, but you get an error with non-multi types: ```python >>> from shapely.geometry import Point, MultiPoint >>> p = Point(1, 1) >>> mp = MultiPoint([(1, 1), (2, 2), (3, 3)]) >>> print(mp) MULTIPOINT (1 1, 2 2, 3 3) # using list(..) as way to iterate over it >>> list(p) ... TypeError: 'Point' object is not iterable >>> list(mp) [<shapely.geometry.point.Point at 0x7f98475a0d90>, <shapely.geometry.point.Point at 0x7f98475a0550>, <shapely.geometry.point.Point at 0x7f98475a0d30>] # getitem also works >>> mp[0] <shapely.geometry.point.Point at 0x7f98462cab50> ``` This can certainly be convenient in certain cases, but also creates an inconsistency between the different geometry types and it can also create some problems in other cases. For example, putting such scalars in a numpy object array: ```python # single point is fine >>> np.array([p], dtype=object) array([<shapely.geometry.point.Point object at 0x7f98476cb2e0>], dtype=object) # MultiPoint raises an error >>> np.array([mp], dtype=object) ... TypeError: object of type 'Point' has no len() # MultiPolygon gets splitted >>> from shapely.geometry import MultiPolygon, box >>> mpoly = MultiPolygon([box(0, 0, 1, 1), box(2, 2, 3, 3)]) >>> np.array([mpoly], dtype=object) array([[<shapely.geometry.polygon.Polygon object at 0x7f9846379ac0>, <shapely.geometry.polygon.Polygon object at 0x7f9846379670>]], dtype=object) ``` The "Multi" geometry types also have a `.geoms` attribute, which gives access to a list of geomtries to iterate over. So the question here can be: isn't this sufficient? Accessing this specific attribute doesn't seem problematic, as you need to know anyway that you have a "Multi" geomtry, as the others will raise an error on iter/getitem. ## Other aspects of the geometry class to check - ctypes attribute - array interface ? - ... ## Other discussion points - How to add the ufuncs to Shapely's API? This can be just in a submodule of shapely, eg `shapely.geos` ? (as the module that exposes the "raw" GEOS functionality on which the the other modules are based) - Integration with the other shapely submodules? (`ops`, `algorithms`, `affinity`, ...)