# HubGrep PrototypeFund demoweek EN ![hobgrebbit](https://i.imgur.com/q0oDpFt.png =250x) ## Searching for open source projects Search through the open source world and it will quickly give you the impression that almost all roads lead to GitHub. As a developer you have likely used your browsers default search engine, with the hope that what you're looking for is the top result - and if it's not, routinely sift through irrelevant non-software results while looking for a project that fitse to your problem. You can avoid this by searching on one of the open source platforms directly, but then you know for sure that you will not find projects which are hosted on other services - and this does not only mean the big ones either, there are multitudes of self-hosted platforms containing anywhere from only a handful to a hundred repositories. We feel there's an improvement to be made in here somewhere. Another partial issue is transparency in regards to HOW items in your search result end up where they do. We want to make sure that results are ranked, ordered and presented in a transparent way. We say partial because some general search engines as well as open source platforms are already open source themselves, alleviating the issue. For those that are not open source, you can't be sure if results are tailored for you based on your history in some form, if someone paid to be on top, or for any other reasons. Ideally, you should know that the tools you use caters to what you ask of it, and not a potential external interest. ## Introducing HubGrep ![Search with HubGrep](https://i.imgur.com/d0yy3ih.gif) [HubGrep](https://hubgrep.io/) is a dedicated search engine that focuses on giving you open source results, as well as letting you combine results from all open source platforms, without bias for large or small, at once. So how do we know "all" open source platforms? We assuredly don't, but if you know of one that isn't included you can tell us or [add it yourself directly on HubGrep](https://hubgrep.io/add_instance/step_1) and it will be included and available for everyone. The only delay is the time it will take for our crawlers to cycle through the projects avaialable on said platform. In addition to matching software with your search-phrase, we want to give you control via filters to narrow down results to your specific criteria; maybe it's important to you when a project was last updated, how old it is, or which language it uses. We want to present results based on what you ask for, and not based on who we think you are, or what you might like. As such, the only user data that we may work with is a potential settings cookie so that you have the option to persist a set of default filters for the next time you want to use HubGrep. ## Removing a click in your workflow Using a new website to search for projects instead of your current workflow is a routine that is hard to change; as visiting a second website is one more click compared to using your normal search engine. Luckily most browsers give you the option to add custom search engines with a keyword, allowing you to search directly in the url field without having to visit the landing page of said search engine. ![search with a keyword](https://i.imgur.com/oKDGzB5.gif) For example, I can add https://hubgrep.io/?s=% and assign it to a keyword `h` which will let me type `h my search phrase` in the url field of my browser and it will directly send me to the results page on HubGrep for this search query. Unfortunately this is not something we can automate for a user, as they themselves must set this up depending on their browser. Hopefully our [documentation](https://docs.hubgrep.io/en/latest/docs/hubgrep_search/usage/add_to_browser.html) can help you to set this up for yourself. ## Producing a search-engine You might be interested in how we went about to build HubGrep, at an overview. Our first approach was pretty simple and naive: we built a website with a search field, acting as a proxy towards services that we know of, such as GitHub. After sending your search query to each service, we collect the results, sort them by our own weights, and produce a list to you. With this, we had a first quick prototype, and it felt like a good thing to have a search engine like that. But just acting as a proxy has some drawbacks: - Loading search results are as slow as the slowest hoster you are searching on, which becomes more unreliable and inconsistent as more hosters are included, as you have to wait on all of them before presenting results. - You have to trust the hoster in giving you the relevant results first, as you only get a subset for "the first page" of all results (or else loading is even slower since you'll be sending more requests). - The hoster has to actually provide a usable search api. - We have to weigh and order results ourselves, and somehow merge them before presenting them. If you want to try out what such a search engine looks like you can do that [here](https://meta.hubgrep.io/)! To move away from some of the above issues, our next step was to build our own search index, containing meta-data associated with open source projects. We implemented [crawlers](https://github.com/HubGrep/hubgrep_crawlers), capable of going through all projects on a hosting service, collecting their meta-data. Our many crawlers communicate with our [indexer](https://github.com/HubGrep/hubgrep_indexer) which unifies and stores all the data. The cycle of crawlers collecting data eventually ends when we consider ourselves having gone through all content from start to finish. At this point, we export the unified data to be used in the actual search engine. After this, the next cycle of crawling will begin - rinse and repeat. So far we've implemented our crawlers and indexer to support collecting projects hosted on GitHub, GitLab, and Gitea platforms. There are [more services](https://en.wikipedia.org/wiki/Forge_(software)) out there of course - and not only the list in the link. Collecting them is a process that takes time, which we hope that users will help us with as detailed above. As well as implementing crawlers for new types which does not match any of the current 3 platform types. Another thing, as a pleasant side effect of our indexer: we also keep the "unprepared" [raw data](https://indexer.hubgrep.io/). So, if you have any nice ideas for projects, want to dataviz something up, or just poke around, feel free to use our [exports](https://hubgrep.io/hosters). We intend to improve our own API to include better functionality for data access as well. ## Where do we go from here We have a working search engine now - but in a way, this is just the beginning. Improving weight and order of results is our priority - as without effort, results are likely to produce the wrong items at the top, while technically still being valid matches. We also want to include public Bitbucket projects, which proved a challenge, hence we left this task for later. Some API's also provide more meta-data than others, and knowing which license or which majority language is used is something we don't know across all projects. A very large task would be to actually clone projects, analyse them, and then extend missing meta-data to consistently provide missing information - this will bump up hardware requirements. Additionally, we would have all commit ids of the projects, so we could detect forks, mirrors and duplicates across all the hosting services, providing a way to "unclutter" and group repeated content spread over multiple hosters. In our minds, we are in a good position to give HubGrep more value than "just" indexing what the API's provide, and "just" being a search engine. It could be a map of the open-source world!