Understanding "Listings"

# Understanding "Listings" ## How are listings stored? Integrations are always 1-1 with warheouse entities tables via the `office_id` field. We store the authentication for each listings integration in warehouse and pull down the specific listings that match the realestate_feeds authentications for that listings provider We currently store listings in tables defined per brand (this is not something that is necessarily useful, but has been the way we do things since 1901) We can then filter down each integration rows listings via custom params to get a subset of those listings. ## What challenges are there? ### Listing Duplication let's use constellation as an example. We have a single constellation table for all listings. They are indexed/partioned by msl_office_id. We reference the warehouse entities table using the office_id filed on the 200 integrations table. (NOTE: Warehouse stores all listings authentication details inside warehouse db.) In one use case, we have the follow custom params for two integrations for constellation. ```json { "filters": [ { "field": "mls_name", "operator": "=", "value": "ires" }, { "field": "mls_office_id", "operator": "not in", "value": [ "0FCPS", "LLEG", "0FSID", "0WSTB", "0TOWN" ] } ] } { "filters": [ { "field": "mls_name", "operator": "=", "value": "ires" }, { "field": "mls_office_id", "operator": "in", "value": [ "0FCPS", "LLEG", "0FSID", "0WSTB", "0TOWN" ] } ] } ``` ```json { "filters": [ { "field": "mls_name", "operator": "=", "value": "ires" }, { "field": "mls_office_id", "operator": "not in", "value": [ "TGIDO", "9GRPN", "9GRPS", "1GRPG", "8GRPI", "8GRPC", "9GRPH" ] } ] } { "filters": [ { "field": "mls_name", "operator": "=", "value": "ires" }, { "field": "mls_office_id", "operator": "in", "value": [ "TGIDO", "9GRPN", "9GRPS", "1GRPG", "8GRPI", "8GRPC", "9GRPH" ] } ] } ``` You will notice that the only difference between these sets of integrations the `in` vs `not_in` operators. Essentially what this means is that multiple orgs that use the `not_in` integration will be duplicating the same listings into their org as it evalutes to true for the same listings per integration. This can cause confusion, more db space and longer sync/wait times from warehouse to 200. This can be resolved with listings service abstraction and decoupling from 200, however it would mean that every time we grab listings, we would need to obtain them via these complex params. example: `provider + mls_office_ids` which would remove the performance of having a integration_id index. ### Building Property URLs A lot of the time property urls will be made via replacement of a standard string with various attributes of the property listings, coupled with some custom logic that may or may not be tied to the specific integration. We do this with custom params (haha) again. Here is an example of something that will allow a specific org using this integration to format it's property urls in a uniq way: ```json { "replace": [ { "field": "externallink", "find": "**", "replace": "property", "is_field": false, "conditions": [ { "field": "type", "value": "residential" } ] }, { "field": "externallink", "find": "**", "replace": "land", "is_field": false, "conditions": [ { "field": "type", "value": "land" } ] }, { "field": "sourceid", "find": "**", "replace": "alternate_id", "is_field": true }, { "field": "alternate_id", "find": "**", "replace": "l", "is_field": false } ] } ``` ```json { "replace": [ { "field": "temp", "find": "**", "replace": "external_link_2", "is_source_field": true }, { "field": "externallink", "find": "**", "replace": "temp", "is_field": true, "conditions": [ { "field": "temp", "value": "is_not_empty" } ] }, { "field": "externallink", "find": "**", "replace": "alternate_id", "is_field": true, "conditions": [ { "field": "temp", "value": "is_empty" } ] }, { "field": "externallink", "find": "**", "replace": "l3815833", "is_source_field": false, "conditions": [ { "field": "sourceid", "value": "3273686" } ] }, { "field": "externallink", "find": "**", "replace": "l3815429", "is_source_field": false, "conditions": [ { "field": "sourceid", "value": "3273362" } ] } ], "concat": [ { "field": "externallink", "concat": "l", "is_field": false, "type": "prepend", "conditions": [ { "field": "temp", "value": "is_empty" }, { "field": "listingtype", "value": "sale" } ] }, { "field": "externallink", "concat": "r", "is_field": false, "type": "prepend", "conditions": [ { "field": "temp", "value": "is_empty" }, { "field": "listingtype", "value": "lease" } ] } ] } ``` We should consider that this is a tool/interface for this to inform the transformer how to build the listings urls. This is also something that may not be able to be purely on 200 side as a lot of the information to build these urls sits in warehosue only. ### Extended Data For all fields that do no existing the 200 listings DB, we store in a JSON array called `extended_data`. This is MOSTLY used for agent details and other display only fields, but can house literally anything. This is because we cannot add/update 200 db fields due to the sheer size of the table As we are not using MYSQL 8, we cannot query this json in a nice way so it cannot be used to filter listings efficiently. ### Interpreters We currently have classes within warehouse called interpreters that essentially act as custom definitions of provider transformations that allow office specific changes. These can range from changing a proeprty_url structure or replacing the region and city fields. These preceded the custom param replace marco language which can now quite easily act a subsitution for these interpreters. There may be, however, some things it can not do. One such interpreter is `{"interpreter":"rex_leaddrop"}` In this case, we get agents emails as: `381.14202@leaddrop.rexsoftware.com`. We then have to get a list of ALL agents for that office, then match the first and last names of the record with that agent, and get their real email address. It's things like this that scare me.... ### External Endpoints is implemented poorly Some endpoints do not return nicely paginated or sorted results so we have to essentially manage the results locally based on the data returned, but cannot expect the next batch to be in chronological order based on the last. Basically we just need to keep fetching all the pages up until the maxpages header value or else we will skip data This is bad if we are in the middle of a large sync and something goes wrong. When then need to resync the entire batch because we have no idea how if the rest of the data was before or after the modified of what we already have. Basically CRM's suck. ### Image Expiry We have had a lot of complications with images being sycned from warehouse. One is that a specific crm only has a 4 hour validity check on their images, after which the images are unable to downloaded. To fix this, we actually download the images in warehouse first so that we ahve an s3 local version and then send that to 200. This is shit, so shit, because 200 will redownload and reupload that image.