Project NoSQL - Elasticsearch

# Project NoSQL - Elasticsearch ### BURY Maxime - ETENDARD Guillaume - TEA TOM ### Installation You can run Elasticsearch on your own hardware, or use the hosted Elasticsearch Service on Elastic Cloud. For this project, we use the Elastic Cloud version, but it's very easy to install it on your hardware. You just have to download the archive, extract it : ``` wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz.sha512 shasum -a 512 -c elasticsearch-7.10.1-linux-x86_64.tar.gz.sha512 tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz cd elasticsearch-7.10.1/ ``` and run the program inside : ``` ./bin/elasticsearch ``` ### I - Introduction Elasticsearch is a distributed, open source search and analytics engine for all types of data. It is very famous because of its speed & scability. Elasticsearch has the ability to index many types of content, we use it for numerous use cases such as application search, enterprise search, application performance monitoring, and security analytics to name a few(paraphrased from Elastic). It has several areas of use. It can be used for Machine Learning, Security, Data analytics on datasets for example. ### II - Properties, Strengths & Limitations ### Properties - With one query, you can combine different kind of searches and data type which include structured, unstructured, geo and metrics - Possible to analyze billions of records in few seconds. - It provides aggregations which can explore trends and patterns of data. - Elasticsearch provides support for various languages including Java, Python, PHP, JavaScript, Node.js, Ruby, and many more. ### Strengths #### Sharding We know that Elastisearch is extremely scalable. It could adapt to support an increasing amount of data or demands placed on it. Sharding is one of the pratice for Elasticsearch that give its scability. Shard is the unit at which Elasticsearch distributes data around the cluster. The speed of moving shards around when rebalancing data will depend on the size and number of shards. We can take as example the network and the disk performance. Another illustration, if we have an index containing documents about movies contain about 1000 gigabytes of data. We have two nodes in our cluster each has 500 gigabytes available for storing data. So we did sharding, it means that we divide the indices into smaller pieces called shards. A shard contains a part of index's data and can be distributed across nodes within the cluster. #### Replication Replication creates copies of shards and keeps the copies in different nodes. if the node goes down, the copies stored in the other nodes would step up to the plate and serve request. Elasticsearch will automatically replicates shards without any request, configuration. #### Rapidity Elasticsearch is really fast. It's rapidity of use is due to an inverse indexation with finite state transducers for full-text searches, BKD trees for storing digital and geographic data, as well as a column structure for their analysis. Thanks to this rapidity, it gives the possibility to use the gained time to make a more precise research on the datasets. #### Relevance You can classify your results of research according to several factors as date, frequency, popularity and more. Moreover, it revises when the user is making writing errors. #### Diversity of usage Elasticsearch has a lot of tools of different kinds. You can use Elastic Metrics to visualize the indicators of your logs created with Elastic Logs. You can check the performance with APM, do a centralized research with Workplace Search to decompartmentalize compagnie datas, you can monitor and resolve your disponibility problems with Uptime, or even prevent, detect and hunt down threads with Endpoint Security. ### Limitations #### System requirements Elasticsearch needs a group of servers having 64GB of RAM to work efficiently. Otherwise, if we use too many small servers, it creates overhead or if we use a few powerful servers, there is a chance of failover. Moreover, queries run faster if data stored in SSDs rather than rotating disks. However, SSDs are more expensive, this creates the infrastructure overpriced. #### Security ElasticSearch has no security out of the box. You have to install third parties plugins if you want security. #### It's a search engine, not a database It's not recommended to use ElasticSearch as the primary store #### Lack of ACID An “ACID-compliant data store” (Atomicity, Consistency, Isolation, Durability) means that technology is highly resistant to corruption and data loss. Data loss can happen in a number of ways, you need to be able to recreate the data if needed. Updates made between the most recent snapshot and the outage will be lost unless you have another system in place to queue them #### No Transaction ElasticSearch has not transaction, which means that there is no rollback facility, and that every operation is atomic, there is no way to cancel, abort or revert them. Moreover, if you're doing parallel writes, there is no locking standard in the system. ### III - Use Cases #### Full Text Search ElasticSearch is the most powerful tool for text searching. The ability to handle fuzzy searches and the autocomplete functionality make it very powerful. The website that have switch to ElasticSearch saw great differences. #### Logging and Analysis ElasticSearch can be use to store and analyze large log data. With the Elastic Stack, you can index and analyze millions logs per day. #### Scraping Data ElasticSearch have the ability to makes scraping easy. You can scrap data from different sources while keeping it all searchable. You can analyze large volumes of data scraped. #### Metrics Some of the biggest companies use the Elastic Stack to run through billions of documents per day, analyzing, and get metrics and important KPI's. ### IV - Application The syntax of the query is very simple. You initiate the query by specifying an HTTP method(GET, POST, PUT, DELETE). Then, we specify the API we want to access and what we would like to accomplish(command). #### Dataset We used the dataset kibana sample data flights provide by Kibana in Elastic Cloud : FlightNum : Number of Flight DestCountry: Destination of Country OriginWeather : Weather from the origin flight OriginCityName : Origin of the city AvgTicketPrice: Average ticket price DistanceMiles: Distance in Miles FlightDelay: If the flight is delay DestWeather: Weather of the destination Dest : Destination Airport FlightDelayType : Type of flight delay OriginCountry : Origin of the country dayOfWeek : The day of week (in number) DistanceKilometers : Distance in Kilometers timestamp : Date (Month Day, Year, hour) DestLocation : Destination Localation in geopoint { "lat", "lon"} DestAirportID : ID of the destination airport Carrier : Company Cancelled : If the flight is cancelled FlightTimeMin : Duration of the flight in minutes Origin : Origin of the flight With the ElasticSearch tool Kibana, we can have an overview of the dataset : ![](https://i.imgur.com/6VfpEWF.png) <center> This is the data the for the last 24 hours</center> #### List of nodes We can get a list of nodes that are in our cluster. To get this information, we use the _cat API. So we will send a GET request to the _cat API & will use a command nodes?v to get the list of nodes in our cluster. GET /_cat/nodes?v It includes information about node's IP adderss, roles, names as well as some performance measures. - Match All : return all the result GET /kibana_sample_data_flights/_search { "query": { "match_all": {} } } #### Queries - Match Queries : accept text/numeric/dates, analyzes it and construct a query out of it GET /kibana_sample_data_flights/_search { "query": { "match" :{"OriginCityName":"Tokyo"} } } Here we match all data where the origin of the city is Tokyo. For each output we have a score which correspond to the level of pertinence. - Queries String : A query that uses a query parser in order to parse its content GET /kibana_sample_data_flights/_search { "query": { "query_string": { "query": "Moscow" } } } - Here we precise that the field is "OriginCityName" GET /kibana_sample_data_flights/_search { "query": { "query_string": { "query": "Moscow", "fields":["OriginCityName"] } } } - Setting value in a field PUT /kibana_sample_data_flights/_search { "FlightNum" : "8JY0ER1", "DestCountry" : "JP", "OriginWeather" : "Rain", "OriginCityName" : "Moscow", "AvgTicketPrice" : 867.6342067834399, "DistanceMiles" : 5460.667255132447, "FlightDelay" : true, "DestWeather" : "Rain", "Dest" : "Wichita Mid Continent Airport", "FlightDelayType" : "NAS Delay", "OriginCountry" : "CN", "dayOfWeek" : 3, "DistanceKilometers" : 8788.092083043874, "timestamp" : "2020-12-17T04:40:26", "DestLocation" : { "lat" : "37.64989853", "lon" : "-97.43309784" } Here, we change the DestCountry(US by JP) & OriginCountry(RU by CN): - Multiple in Where Clause, we precise where DestCountry == US & OriginCountry == RU GET /kibana_sample_data_flights/_search { "query": { "query_string": { "query": "DestCountry:US AND OriginCountry:RU" } } } ### Fuzziness We can search term that are similar to, but not exactly like our search terms, using "fuzzy" operator ~ ### Proximity Searches While a phrase query expects all of the terms in exactly same order, a proximity query allows the specified words to be further apart or in a different order. Same way that fuzzy queries can specify a maximum edit distance for characters in a word a proximity search allows us to specify a maximum edit distance of words in a phrase. - Example : GET /kibana_sample_data_flights/_search { "query": { "query_string": { "query": "\"Zurich Airport\"" } } } ### Filters Filter use for caching. Caching the result of a filter doesn't require a lot of memory & will cause other queries executing against same filter to be blazingly fast. These filters, include the term, terms , prefix & range filters, are by default cached & are recommended to use(compared to the equivalent query version) when the same filter (same parameters) will be used across multiple different queries. ### Geo Distance Filter Filters documents that include only hits that exists within a specific distance from a geo point which is really interesting for application on geolocalisation. - We want to find all flights with a specific geopoint (37.46910095, 126.4509964) GET /kibana_sample_data_flights/_search { "query": { "bool" : { "must" : { "match_all" : {} }, "filter" : { "geo_distance" : { "distance" : "200km", "OriginLocation": { "lat" : "37.46910095", "lon" : "126.4509964" } } } } } } ### Range filter Filters documents with fields that have terms within a certain range. Similar to range query, except that it acts as a filter. Can be placed within queries that accept a filter. lte : less than gte : greater than GET /kibana_sample_data_flights/_search { "query": { "bool": { "must": { "query_string":{ "query":"Moscow" } }, "filter":{ "range":{ "AvgTicketPrice":{ "lte": 500 } } } } } } Here we look at all average ticket price less than 500 for Moscow ### Match Phrase (specified the order) We look at the OriginWeather specified in that order "Thunder & Lightning". GET /kibana_sample_data_flights/_search { "query": { "match_phrase":{ "OriginWeather":"Thunder & Lightning" } } } Specified order for Destination "Incheon International Airport" GET /kibana_sample_data_flights/_search { "query": { "match_phrase":{ "Dest":"Incheon International Airport" } } } ### Bool Query Request based on many criterias. Must -> means that we must have the word we query GET /kibana_sample_data_flights/_search { "query":{ "bool":{ "must":[{ "query_string":{ "query":"Incheon International Airport" } }], "should": [{ "match_phrase":{ "Dest":"Incheon International Airport" } }] } } } ### Example We look at all flights from Seoul to Manchester GET /kibana_sample_data_flights/_search { "query": { "query_string": { "query": "OriginCityName:Seoul AND DestCityName:Manchester" } } } We got 8 results for that query { "took" : 8, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 4, "relation" : "eq" }, "max_score" : 8.559007, "hits" : [ { "_index" : "kibana_sample_data_flights", "_type" : "_doc", "_id" : "UDGxCHYBc6dI43sjjML1", "_score" : 8.559007, "_source" : { "FlightNum" : "8BIFW73", "DestCountry" : "GB", "OriginWeather" : "Rain", "OriginCityName" : "Seoul", "AvgTicketPrice" : 934.3075430000141, "DistanceMiles" : 5468.812762075307, "FlightDelay" : false, "DestWeather" : "Sunny", "Dest" : "Manchester Airport", "FlightDelayType" : "No Delay", "OriginCountry" : "KR", "dayOfWeek" : 2, "DistanceKilometers" : 8801.201005769322, "timestamp" : "2020-12-02T22:29:45", "DestLocation" : { "lat" : "53.35369873", "lon" : "-2.274950027" }, "DestAirportID" : "MAN", "Carrier" : "JetBeats", "Cancelled" : false, "FlightTimeMin" : 517.7177062217248, "Origin" : "Incheon International Airport", "OriginLocation" : { "lat" : "37.46910095", "lon" : "126.4509964" }, "DestRegion" : "GB-ENG", "OriginAirportID" : "ICN", "OriginRegion" : "SE-BD", "DestCityName" : "Manchester", "FlightTimeHour" : 8.628628437028746, "FlightDelayMin" : 0 } }, { "_index" : "kibana_sample_data_flights", "_type" : "_doc", "_id" : "xE2xCHYBQTOhRJ3qkM9y", "_score" : 8.559007, "_source" : { "FlightNum" : "Y09SG4R", "DestCountry" : "GB", "OriginWeather" : "Heavy Fog", "OriginCityName" : "Seoul", "AvgTicketPrice" : 1105.7271555970426, "DistanceMiles" : 5468.812762075307, "FlightDelay" : false, "DestWeather" : "Cloudy", "Dest" : "Manchester Airport", "FlightDelayType" : "No Delay", "OriginCountry" : "KR", "dayOfWeek" : 5, "DistanceKilometers" : 8801.201005769322, "timestamp" : "2020-12-05T01:04:12", "DestLocation" : { "lat" : "53.35369873", "lon" : "-2.274950027" }, "DestAirportID" : "MAN", "Carrier" : "Logstash Airways", "Cancelled" : false, "FlightTimeMin" : 440.06005028846613, "Origin" : "Incheon International Airport", "OriginLocation" : { "lat" : "37.46910095", "lon" : "126.4509964" }, "DestRegion" : "GB-ENG", "OriginAirportID" : "ICN", "OriginRegion" : "SE-BD", "DestCityName" : "Manchester", "FlightTimeHour" : 7.334334171474436, "FlightDelayMin" : 0 } }, { "_index" : "kibana_sample_data_flights", "_type" : "_doc", "_id" : "BTGxCHYBc6dI43sjdbdm", "_score" : 8.559007, "_source" : { "FlightNum" : "8PG5RJU", "DestCountry" : "GB", "OriginWeather" : "Rain", "OriginCityName" : "Seoul", "AvgTicketPrice" : 939.8822405873354, "DistanceMiles" : 5468.812762075307, "FlightDelay" : false, "DestWeather" : "Clear", "Dest" : "Manchester Airport", "FlightDelayType" : "No Delay", "OriginCountry" : "KR", "dayOfWeek" : 2, "DistanceKilometers" : 8801.201005769322, "timestamp" : "2020-11-18T04:50:56", "DestLocation" : { "lat" : "53.35369873", "lon" : "-2.274950027" }, "DestAirportID" : "MAN", "Carrier" : "Logstash Airways", "Cancelled" : false, "FlightTimeMin" : 488.955611431629, "Origin" : "Incheon International Airport", "OriginLocation" : { "lat" : "37.46910095", "lon" : "126.4509964" }, "DestRegion" : "GB-ENG", "OriginAirportID" : "ICN", "OriginRegion" : "SE-BD", "DestCityName" : "Manchester", "FlightTimeHour" : 8.14926019052715, "FlightDelayMin" : 0 } }, { "_index" : "kibana_sample_data_flights", "_type" : "_doc", "_id" : "gE2xCHYBQTOhRJ3qpebz", "_score" : 8.559007, "_source" : { "FlightNum" : "6J0PQVN", "DestCountry" : "GB", "OriginWeather" : "Hail", "OriginCityName" : "Seoul", "AvgTicketPrice" : 697.2083968654315, "DistanceMiles" : 5472.234893776964, "FlightDelay" : false, "DestWeather" : "Cloudy", "Dest" : "Manchester Airport", "FlightDelayType" : "No Delay", "OriginCountry" : "KR", "dayOfWeek" : 6, "DistanceKilometers" : 8806.708392890594, "timestamp" : "2020-12-27T08:13:59", "DestLocation" : { "lat" : "53.35369873", "lon" : "-2.274950027" }, "DestAirportID" : "MAN", "Carrier" : "Logstash Airways", "Cancelled" : false, "FlightTimeMin" : 463.51096804687336, "Origin" : "Gimpo International Airport", "OriginLocation" : { "lat" : "37.5583", "lon" : "126.791" }, "DestRegion" : "GB-ENG", "OriginAirportID" : "GMP", "OriginRegion" : "SE-BD", "DestCityName" : "Manchester", "FlightTimeHour" : 7.725182800781223, "FlightDelayMin" : 0 } } ] } } ### Performance #### Query Profiler for index ##### Example of query string ``` { "query": { "query_string": { "query": "OriginCityName:Seoul AND DestCityName:Manchester" } } } ``` ![](https://i.imgur.com/vfZGhJZ.png) ##### Example of match query ``` GET /kibana_sample_data_flights/_search { "query": { "match" :{"OriginCityName":"Tokyo"} } } ``` ![](https://i.imgur.com/WSveYKb.png) #### Queries Time A query for all flights in the last 7 days with 2189 results take 175ms ![](https://i.imgur.com/vq67Gxc.png) Increasing the number of partitions with a query for all flights in the last 30 days with 4107 results take 188ms ![](https://i.imgur.com/k73YyQA.png) With a bool query for all cancelled flights ![](https://i.imgur.com/wCRaaUf.png) All flights with a distance less than 200 km ![](https://i.imgur.com/gypyImR.png) ### Conclusion ElasticSearch is an open source distributed engine which gives the user the possibility to do many things like application search, performance monitoring on applications or even security check. It has many strength thanks to its possibility to analize a huge number of records in no time, for being extremely scalable giving it a good adaptaion on the increasing amount of data. Finally, it allows the user to classify the results of research as the user want and revises the user's errors. However, as other tools, ElasticSearch has flaws. It is not secure out of the box, if you want to secure your work, you have to instal a third parties plugin. It needs a big server to work effencientely, 64GB of RAM to be precise. If not, it creates overhead and can lead to failover. Moreover, it is weak to corruption and data loss. Finally, it has no transaction, which means it has no rollback facility and every operation can't be cancelled, abort or reverted. In conclusion, depending of your needs, ElasticSearch can be a very good tool. If you need a tool for search or analisys, ElasticSearch is the tool for you. It is certified for those needs as other big compagnies use it as Microsoft, Googleor even Facebook. If not, don't use it.