HackMD - Collaborative Markdown Knowledge Base

Google and Bing now render JavaScript when crawling websites. However, there is a lot that can go wrong in that process. The first step you should take in debugging the problem is to sign up for Google Search Console, verify your site, and then use the inspection tool. It can show a rendered screenshot of your site so that you can see if Googlebot is actually seeing your content or not. Here is a list of the common things that can go wrong: Only Google and Bing are advanced enough to index JavaScript websites Other search engines such as Yandex, and Baidu are still not indexing client-side rendered websites as far as I know. Since Google has a 90%+ share of the search market, this may not be a deal breaker for you. If you need your JS site to show up in the other search engines, you can use server-side-rendering (SSR) which is described further in a section below. Google takes months to index new websites Google seems to be taking months to index any new website these days, regardless of whether it it requires rendering. I'd expect an 8 month old site to have at least some of its content indexed, but keep in mind that it could just be a matter of waiting longer. Websites that require rendering take longer Googlebot has separate queues for regular crawling and rendering. It does a first pass to grab the server supplied HTML then it comes back later to do the rendering. Google made some announcements that typical delay between first crawl and rendering is now down to seconds. Despite that, websites that require rendering often seem to lag in indexing by days or weeks compared to pages that don't need to be rendered. See Rendering Queue: Google Needs 9X More Time To Crawl JS Than HTML | Onely Assign each piece of content its own URL When you are using a single-page-application (SPA) framework, it is tempting to just use a single URL for your entire website. Doing so will kill your SEO. Google needs to be able to direct users to specific content deep within your site instead of sending all visitors to your home page. That means that you need to assign each piece of content on your site its own URL. Google will only crawl and index content that has its own URL. If you have a true one-page site, Google will only ever index the content that is visible when the home page loads. Note that you can still use SPA frameworks, you just have to use pushState to change URLs for users without causing the full page to re-load from the server. Ensure that your web app loads and shows the correct content for any starting URL Your web app needs to load for every URL on your site. The typical way of implementing this is to put a front controller rule into .htaccess that causes index.html to be served, regardless of what URL is requested. Your site needs to show the correct content for that deep URL without navigating there from the home page. Googlebot will crawl your site by starting at every URL. If it doesn't get the content for that URL by visiting it first, your site won't get indexed. Additionally, users from Google need to see that deep content for the URL when the click from the search results to it. You need to make sure that the content is visible within a few seconds of the page loading. Googlebot only allows the page to render for a limited time. All content needs to load for a URL without any user interaction. Googlebot doesn't simulate any user interaction such as clicking, scrolling, or typing. Render '<a href=...></a>' anchors for navigation Googlebot only finds deep URLs in your site by scanning the document object model (DOM) for links. Googlebot doesn't click on anything, so you need to use links in your rendered HTML to tell Googlebot about all the pages. When users click on these links, your JavaScript can intercept the clicks and load the desired content without reloading the whole page. Pay attention to 404 errors If a URL on your site is requested that shouldn't have any content, you shouldn't serve default content for it, you should show a "404 Not Found" error. With an SPA this is harder to do than with server-side content. Common ways of simulating a 404 that Googlebot understands are to have JavaScript change the URL to an actual 404 page, or just render an error message that says "404 Not Found." Consider using server-side rendering Most client-side JavaScript rendered frameworks have some way of rendering the initial page load server-side, usually by running Node.js on the server. When you implement this, search engine bots end up getting a normal HTML and CSS page which makes crawling and indexing much easier. Users will get their first page view pre-rendered, but then use client-side rendering for their subsequent page views. It could be a problem with your content, your link structure, or your reputation If you have all the technical stuff related to client-side rendering figured out, there are lots of more basic reasons that Google chooses not to index content. See Why aren't search engines indexing my content? # WHY There are a number of reasons your content may not appear in search engine results, however, it is important to note that a search engine's index may contain pages that it doesn't display in its results page. How to tell if your content is actually indexed It may actually be difficult to tell if your content is indexed. Search for all the documents from your site and see how many are listed Google: enter site:example.com (where example.com is your domain, there must not be any space after the colon.) Bing: enter site: example.com Yahoo: enter site: example.com (or use advanced search form) Search for a specific document by a unique sentence of eight to twelve words and search for that sentence in quotes. For example, to find this document, you might choose to search for "number of reasons your content may not appear in search engine results" In addition to above, search for keywords using inurl: and intitle: you may try something like, keyword with another keyword inurl:example.com this will bring upi pages that are indexed only for specified domain. Log into webmaster tools to see stats from the search engine itself about how many pages are indexed from the site Google Webmaster Tools – Information is available under "Health" » "Index Status". If you have submitted site maps, you can also see how many documents in each site map file have been indexed. Bing Webmaster Tools In some cases, documents may not appear to be indexed via one of these methods, but documents can be found in the index using other methods. For example, webmaster tools may report that few documents are indexed even when you can search for their sentences and find the documents on the search engine. In such a case, the documents are actually indexed. How content becomes indexed Before search engines index content, they must find it using a web crawler. You should check your webserver's logs to see whether search engines' crawlers (identified by their user agent – e.g. Googlebot, Bing/MSNbot) are visiting your site. Larger search engines like Google and Bing typically crawl sites frequently, but the crawler may not know about new site. You can notify search engines to the existence of your site by registering as its webmaster (Google Webmaster Tools, Bing Webmaster Tools) or, if the search engine does not provide this facility, submitting a link to its crawlers (e.g. Yahoo). How long has your site/content been online? Search engines may index content very quickly after it has been found, however, these updates are occasionally delayed. Smaller search engines can also be much less responsive and take weeks to index new content. If your site hasn't been live for more than a few months, the search engines may not trust it enough to index much content from it yet. Do other websites link to your content? If your content has only been online for several days and does not have any links from other sites (or its links come from sites which crawlers do not visit frequently) it is probably not indexed. You may be able to speed up indexing through white-hat techniques for attracting high quality inbound links, such as by linking to your content (when relevant) from your related social media account or blog, and by creating content that is compelling enough that other websites naturally want to link to it. Has the content been excluded by the webmaster? This step is especially important if you are taking over a site from someone else and there is an issue with a specific page or directory: check for robots.txt and META robots exclusions and remove them if you want crawlers to index the content being excluded. Is there a technical issue preventing your content from being indexed? If you have an established site but specific content is not being indexed (there are no web crawler hits on the URLs where the content resides) the webmaster tools provided by Google and Bing may provide useful diagnostic information. Google's Crawl Errors documentation provides extensive background on common problems for web crawlers which prevent content from being indexed and, if you use Google Webmaster Tools, you will receive an alert if any of these issues are detected on your site. Correct errors and misconfigurations as quickly as possible to ensure that all of your site's content is indexed. Is the content low quality? Search engines don't index most pages they crawl. They only index the highest quality content. Search engines will not index content if: It is spam, gibberish, or nonsense. It is found elsewhere. When search engines find duplicate content, they choose only one of the duplicates to index. Usually that is the original that has more reputation and links. It is thin. It needs more than a couple lines of original text. Preferably much more. Automatically created pages with little content such as a page for each of your users are unlikely to get indexed. It doesn't have enough reputation or links. A page may be buried too deep in your site to rank. Any page without external links and more than a few clicks from the home page is unlikely to get indexed. Is some of your content indexed, but not all? If your site has hundreds of pages, Google will almost never choose to index every single page. If you site has tens of thousands of pages, it is very common for Google to choose to index only a small portion of those pages. Google chooses the number of pages to index from a site based on the site's overall reputation and the quality of the content. Google typically indexes a larger percent of a site over time as the site's reputation grows. # HOW Googlebot has timeouts, but they're generally longer than 5 seconds. If a query takes too long, the bot will often leave it and retry later. Over time if it consistently times out, or takes too long, Google will assume this is also a bad user experience and either ignore the content or rank it very poorly. Remember, page speed is becoming an increasingly important ranking factor. I run some tests to understand how Google Search Engine handles a Single Page Application. I built the website for running the test in Elm but the same result should be valid also for React, Angular, or any other language/framework. Findings Overview Googlebot run the Javascript on the page and the Ajax calls are properly executed Googlebot waits between 5 to 20 seconds before taking a snapshot of each page The fetching done on request from the Search Console (I call these “T5”) and the “natural” fetching done by Google (I call these “T20”) are different T5 take a snapshot after around 5 seconds, T20 after around 20 seconds Different sections of the page are snapshotted at different times. For example in the T20 case, the title has always T19 and the meta-description has T20 There are mysterious situations where the snapshot are taken in impossible cases. For example, the snapshot is taken after 5 seconds but the page already shows the result of the Ajax call that arrived after 10 seconds Use meaningful HTTP status codes Googlebot uses HTTP status codes to find out if something went wrong when crawling the page. To tell Googlebot if a page can't be crawled or indexed, use a meaningful status code, like a 404 for a page that could not be found or a 401 code for pages behind a login. You can use HTTP status codes to tell Googlebot if a page has moved to a new URL, so that the index can be updated accordingly. Here's a list of HTTP status codes and how they effect Google Search. # Appendix ![Screenshot 2023-12-29 at 2.03.25 PM](https://hackmd.io/_uploads/By4HZJ2DT.png) ![Screenshot 2023-12-29 at 2.03.16 PM](https://hackmd.io/_uploads/BJDrby2Dp.png) ![Screenshot 2023-12-29 at 2.17.47 PM](https://hackmd.io/_uploads/By1iV12D6.png)