Principal Engineer Test

# Search Engine Welcome to the Principal Engineer test project! Your task is to build a mini search engine specifically designed for searching programming documentation using existing technologies. - We are looking for displays of technical skill, efficient time use, and good judgement. - We grade relative to the aptitudes being displayed -- we don't expect all applicants to have domain knowledge of the project going into it. - Note that our backend stack uses crystal, python, javascript and rust. It is OK to use other languages, but preferred to align with our existing stack if possible. - Using AI to assist you in coding is fine. However, careless use of AI tools leading to sloppy code or solutions you don't fully understand will be disqualifying. - Feel free to ask questions by email as needed. # System architecture - Select and implement a crawler to crawl over the pages listed at the bottom of this text - The crawler should stay within the provided subdomain/domain/path restrictions. - If you find that proxy use would be needed to improve crawling, you can skip this and explain how you would architect a crawler with proxying. ### Index - Choose the database architecture of your choice for storing crawl results, indexing, and retrieving from indexed results. Please explain your decision in the project's README file. You can use a single database for everything, multiple databases, hand managed flat file pipelines, etc. as you wish. - While this project is on a small amount of data, assume this would be built to scale in the range of 1PB for the amount of data being indexed. - The maximum allowable search latency is 50ms. This should remain true if we scaled the DB to 1PB. Explain how you would scale your system, but you can implement a single-node version. - Bonus points for supporting both a classical index as well as an embedding based retrieval over results. # User Interface - The final search engine should be deployed at a publicly accessible URL. - Provide a simple interface for your search engine: an input box for the query and a display of the 10 most relevant search results. Each result should include the title, link, and a relevant snippet from the page. - Display the query latency above the results. - Include a basic page to show the number of indexed pages per domain, along with any other useful statistics you’d like to present. # Documentation 1. Include documentation allowing us to build your project from scratch by following it. Include a link to a cloud drive/bucket to download relevant data as necessary, so we don't have to re-run a full crawl to replicate results. 2. Explain architectural choices you've made. 3. Describe the challenges you encountered during crawling, indexing, and ranking, as well as the solutions you implemented. We're interested in that. If you wasted 3 days on a dead path, tell us about it and what you learned. 4. Explain how you optimized ranking to achieve high relevancy in the search results. 5. We want the documentation to be concise and human readable. If you use AI to generate a README for you, spend the time to edit it and make it easy for us to review. # Deliverables 1. A GitHub repository containing the project and accompanying documentation. 2. A live deployment of the search engine for testing. Prioritize both search relevancy and latency. Good luck! # Domains You can limit your crawling to the following domains: angular.io api.drupal.org api.haxe.org api.qunitjs.com babeljs.io backbonejs.org bazel.build bluebirdjs.com bower.io cfdocs.org clojure.org clojuredocs.org codecept.io codeception.com codeigniter.com coffeescript.org cran.r-project.org crystal-lang.org forum.crystal-lang.org css-tricks.com dart.dev dev.mysql.com developer.apple.com developer.mozilla.org developer.wordpress.org doc.deno.land doc.rust-lang.org docs.astro.build docs.aws.amazon.com docs.brew.sh docs.chef.io docs.cypress.io docs.influxdata.com docs.julialang.org docs.microsoft.com docs.npmjs.com docs.oracle.com docs.phalconphp.com docs.python.org docs.rs docs.ruby-lang.org docs.saltproject.io docs.wagtail.org doctrine-project.org docwiki.embarcadero.com eigen.tuxfamily.org elixir-lang.org elm-lang.org en.cppreference.com enzymejs.github.io erights.org erlang.org esbuild.github.io eslint.org expressjs.com fastapi.tiangolo.com flow.org fortran90.org fsharp.org getbootstrap.com getcomposer.org git-scm.com gnu.org gnucobol.sourceforge.io go.dev golang.org graphite.readthedocs.io groovy-lang.org gruntjs.com handlebarsjs.com haskell.org hex.pm hexdocs.pm httpd.apache.org i3wm.org jasmine.github.io javascript.info jekyllrb.com jsdoc.app julialang.org knockoutjs.com kotlinlang.org laravel.com latexref.xyz learn.microsoft.com lesscss.org love2d.org lua.org man7.org mariadb.com mochajs.org modernizr.com momentjs.com mongoosejs.com next.router.vuejs.org next.vuex.vuejs.org nginx.org nim-lang.org nixos.org nodejs.org npmjs.com ocaml.org odin-lang.org openjdk.java.net opentsdb.net perldoc.perl.org php.net playwright.dev pointclouds.org postgresql.org prettier.io pugjs.org pydata.org pytorch.org qt.io r-project.org react-bootstrap.github.io reactivex.io reactjs.org reactnative.dev reactrouterdotcom.fly.dev readthedocs.io readthedocs.org redis.io redux.js.org requirejs.org rethinkdb.com ruby-doc.org ruby-lang.org rust-lang.org rxjs.dev sass-lang.com scala-lang.org scikit-image.org scikit-learn.org spring.io sqlite.org stdlib.ponylang.io superuser.com svelte.dev swift.org tailwindcss.com twig.symfony.com typescriptlang.org underscorejs.org vitejs.dev vitest.dev vuejs.org vueuse.org webpack.js.org wiki.archlinux.org www.chaijs.com www.electronjs.org www.gnu.org www.hammerspoon.org www.khronos.org www.lua.org www.php.net/manual/en/ www.pygame.org www.rubydoc.info www.statsmodels.org www.tcl.tk www.terraform.io www.vagrantup.com www.yiiframework.com yarnpkg.com