# Sitemap XML Comparison
Percentage of equal seo_ids between demo sitemap from new **Seo Feed Generator (SFG)** on a single day job data and prod sitemap from **job-crawler-frontend (JCFE)** on a 7-day job data
### 89%
Percentage of equal seo_ids between 2 consecutive sitemaps from **JCFE**
### 96-98%
Percentage of equal seo_ids between 2 consecutive sitemaps from the new **SFG** service
### 97%
---------------------------
Percentage of equal seo_ids on a test job pool (wisestep) when the input data is exactly the same XML between **SFG** and **JCFE**
### 82-84%
after
- swapping the title in url from **predicted_title** to **cleaned_title**
- filtering **description with less than 500** characters
- not dropping jobs w/o **predicted_title** or **city**
### 94%
after excluding **is_standard** and **is_remote** flags from **seo_id**s
### 97%
after implementing fallback to **cleaned_title** if there is no **predicted_title** in job-title-classifier
### 99.42%
---------------------
The change of percentage of equal seo_ids between sitemap from **SFG** on a single day job data and sitemap from **JCFE** on a 7-day job data after the fixes
### 77% -> 89%
-------------
#### Possible reasons for the remaining difference:
- data from 1-day vs. data from 7 days
- cached data in prod systems (job-title-classifier)
--------------
(not so funny) FUN FACTS ABOUT SITEMAPS AND GOOGLE (based on interviews and comments from Google staff (also from Google Sitemap crawling team) and ChatGPT)
- the whole existence of sitemaps is kinda optional (it is recommended if your website is new or if you have subpages not really accessible/crawlable from the main site)
- the reason why Google advices to have sitemaps is to be able to crawl the websites FASTER (it's not necessity for crawling, if he already visited the site, he mainly uses his old cache or backlinks from other websites)
- sitemap DOES NOT effect SEO rank in any way
- changing the sitemaps dramatically DOES NOT involve any penalty (it's just asked and recommended that big changes on the website are relevantly shown also in sitemaps)
- the famous `<lastmod>` tag
- is actually OPTIONAL
- it is only used by Google to determine crawling frequency
- even recommended by some not to use to lower the sitemap sizes
- it's mostly ignored by Google crawlers as a result of misuses from website owners
- it's definitely ignored if Google detects that that `lastmod` tag was changed without actual content change (which is our case)
- the misuse of the tag does not involve penalty just ignoring from crawlers
## NEXT (POSSIBLE STEPS) STEPS
- **PLAN A** - if the information from one small test feed is not enough we can save the elastic scrolling output from JCFE to S3 and create a test elastic index from the dump on which I can run the JCFE with single day scrolling window to produce similar input to curren SFG workings (it will be really time consuming, I am not sure if it's worth it)
- **PLAN B** - leave it as it is, 89% vs. 96% is not that big difference and it probably won't hurt us
- **PLAN C** - make the change from JCFE to SFG gradual (e.g. 10% of sitemaps every day or so) stretching the change to a longer time frame (we need the env variables in nginx in this case definitely)