Solr - HackMD

# Solr Apache Solr是全文搜尋伺服器，支援不同檔案格式，並針對高流量進行優化。其高度可擴充套件和容錯，同時支援schema和schemaless配置，以及分頁搜尋和過濾，還支援許多主要語言和豐富的文件。在win10上安裝solr-8.3.1。系統環境需要具備Java 1.8+。 ## 安裝 1. 官網下載最新版Apache Solr(Binary releases) 2. 解壓縮至指定位置 ## 啟動 1. 指定目錄到solr-8.3.1\ 2. 輸入指令即可啟動solr: $bin\solr.cmd start -e cloud :::warning 錯誤: 找不到或無法載入主要類別 org.apache.solr.util.SolrCLI > 下載Binary releases ::: 4. 選擇cluster數量(enter即代表默認) ``` Welcome to the SolrCloud example! This interactive session will help you launch a SolrCloud cluster on your local workstation. To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) > 2 ``` 4. 選擇對應的port number(enter即代表默認) ``` Ok, let's start up 2 Solr nodes for your example SolrCloud cluster. Please enter the port for node1 [8983]: Please enter the port for node2 [7574]: ``` 5. Solr進行初始化並在指定的節點上運行 ``` Creating Solr home directory C:\Users\Rita\Desktop\solr-8.3.1\example\cloud\node1\solr Cloning C:\Users\Rita\Desktop\solr-8.3.1\example\cloud\node1 into C:\Users\Rita\Desktop\solr-8.3.1\example\cloud\node2 Starting up Solr on port 8983 using command: "C:\Users\Rita\Desktop\solr-8.3.1\bin\solr.cmd" start -cloud -p 8983 -s "C:\Users\Rita\Desktop\solr-8.3.1\example\cloud\node1\solr" Waiting up to 30 to see Solr running on port 8983 Starting up Solr on port 7574 using command: "C:\Users\Rita\Desktop\solr-8.3.1\bin\solr.cmd" start -cloud -p 7574 -s "C:\Users\Rita\Desktop\solr-8.3.1\example\cloud\node2\solr" -z localhost:9983 Waiting up to 30 to see Solr running on port 7574 Started Solr server on port 8983. Happy searching! Started Solr server on port 7574. Happy searching! INFO - 2019-12-22 16:57:12.358; org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect to ZooKeeper INFO - 2019-12-22 16:57:12.387; org.apache.solr.common.cloud.ConnectionManager; zkClient has connected INFO - 2019-12-22 16:57:12.388; org.apache.solr.common.cloud.ConnectionManager; Client is connected to ZooKeeper INFO - 2019-12-22 16:57:12.407; org.apache.solr.common.cloud.ZkStateReader; Updated live nodes from ZooKeeper... (0) -> (2) INFO - 2019-12-22 16:57:12.429; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider; Cluster at localhost:9983 ready Now let's create a new collection for indexing documents in your 2-node cluster. Please provide a name for your new collection: [gettingstarted] ``` ## 建立資料集 1. collection 前一步的最後詢問new collection的名稱，我們依照官網範例輸入"techproducts" 2. 接下來皆採用預設值，直接enter ``` How many shards would you like to split techproducts into? [2] > enter How many replicas per shard would you like to create? [2] > enter Please choose a configuration for the techproducts collection, available options are: _default or sample_techproducts_configs [_default] > sample_techproducts_configs Created collection 'techproducts' with 2 shard(s), 2 replica(s) with config-set 'techproducts' Enabling auto soft-commits with maxTime 3 secs using the Config API POSTing request to Config API: http://localhost:8983/solr/techproducts/config {"set-property":{"updateHandler.autoSoftCommit.maxTime":"3000"}} Successfully set-property updateHandler.autoSoftCommit.maxTime to 3000 SolrCloud example running, please visit: http://localhost:8983/solr ``` 3. 透過瀏覽器訪問，即可看到以下介面。 ![](https://i.imgur.com/yRluqOA.png) ### 指令指令似乎有修改，建議採用上述互動式進行建立。 ``` bin/solr create -c <yourCollection> -s 2 -rf 2 ``` :::warning (on windows) $bin\solr create_core -c <yourCollection> -s 2 -rf 2 -p 8983 ::: ## Schema ### Create: Field This command uses the Schema API to explicitly define a field named "name" that has the field type "text_general" (a text field). It will not be permitted to have multiple values, but it will be stored (meaning it can be retrieved by queries). ``` curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' http://localhost:8983/solr/films/schema ``` ### Create: Copy Field There’s one more change to make before we start indexing. In the first exercise when we queried the documents we had indexed, we didn’t have to specify a field to search because the configuration we used was set up to copy fields into a text field, and that field was the default when no other field was defined in the query. The configuration we’re using now doesn’t have that rule. We would need to define a field to search for every query. We can, however, set up a "catchall field" by defining a copy field that will take all data from all fields and index it into a field named _text_. ``` curl -X POST -H 'Content-type:application/json' --data-binary '{"add-copy-field" : {"source":"*","dest":"_text_"}}' http://localhost:8983/solr/films/schema ``` ## 新增資料 Solr的bin/post工具，以便於輕鬆索引**各種類型**的文檔。支援的檔案類型可以參考solr\example\exampledocs\，如csv, json, xml, pdf, html, MS office, plain text。執行以下命令即可匯入資料到solr中。 ``` java -jar -Dc=techproducts -Dauto example\exampledocs\post.jar example\exampledocs\* ``` ### Index Sample Film Data ``` java -jar -Dc=films -Dauto example\exampledocs\post.jar example\films\*.json ``` ![](https://i.imgur.com/VLn7eMu.png) ## 搜尋可以通過REST客戶端，curl，wget，Chrome POSTMAN等查詢Solr，也可以通過可用於多種編程語言的本機客戶端查詢Solr。 Solr Admin UI有提供Query的功能。 ### Search for a Single Term 在q參數欄位，輸入想要查找term，如"foundation"。即可針對純文本進行搜尋，如html, pdf。 ![](https://i.imgur.com/NA5RfKN.png) :::info fl欄位可以指定要返回的欄位，其欄位之間用逗號隔開表示。 ::: > Often you want to query across multiple fields at the same time, and this is what we’ve done so far with the "foundation" query. This is possible with the use of copy fields, which are set up already with this set of configurations. ### Field Searches 指定欄位名稱，以及搜尋字串。 ![](https://i.imgur.com/S0FJgvk.png) ### Phrase Search 搜尋multi-term phrase，需要用雙引號包起來，如"CAS latency"。 ### Combining Searches 在搜尋的過程中，可以過濾不想要的字串。以+開頭代表想要，以-開頭代表禁止出現。查找同時包含"electronics"和"music"的文檔 > +electronics +music 查找包含"electronics"，但不包含"music"的文檔 > +electronics -music ### Faceting One of Solr’s most popular features is faceting. Faceting allows the search results to be arranged into subsets (or buckets, or categories), providing a count for each subset. There are several types of faceting: field values, numeric and date ranges, pivots (decision tree), and arbitrary query faceting. #### Field Facets In addition to providing search results, a Solr query can return the number of documents that contain each unique value in the whole result set. #### Range Facets #### Pivot Facets ## 其他 ### 刪除collection bin/solr delete -c <yourCollection> ### 停止啟動的solr節點 bin/solr stop -all ### 刪除所有節點 rm -Rf example/cloud/ :::info **field guessing** Solr attempts to guess what type of data is in a field while it’s indexing it. It also automatically creates new fields in the schema for new fields that appear in incoming documents. This mode is called "Schemaless". We’ll see the benefits and limitations of this approach to help you decide how and where to use it in your real application. ::: ## Index Your Own Data thinking about what you will need to do for your application: * What sorts of data do you need to index? * What will you need to do to prepare Solr for your data (such as, create specific fields, set up copy fields, determine analysis rules, etc.) * What kinds of search options do you want to provide to users? * How much testing will you need to do to ensure everything works the way you expect? ### Create Your Own Collection ``` ./bin/solr create -c localDocs -s 2 -rf 2 ``` :warning: this will use the _default configset and all the schemaless features it provides. ### Indexing Ideas Solr提供以下方法進行index data。 * Local Files with bin/post * DataImportHandler * SolrJ * Documents Screen ### Updating Data Go ahead and edit any of the existing example data files, change some of the data, and re-run the PostTool (bin/post). ### Deleting Data Execute the following command to delete a specific document: ``` bin/post -c localDocs -d "<delete><id>SP2514N</id></delete>" ``` To delete all documents, you can use "delete-by-query" command like: ``` bin/post -c localDocs -d "<delete><query>*:*</query></delete>" ``` You can also modify the above to only delete documents that match a specific query. ## Spatial Queries Solr提供完善的地理空間搜尋，包含在給定位置的指定距離範圍內的搜索、按距離排序、或是按距離進行搜索。如距離舊金山10公里以內的"ipod"查詢。 ## Field ### uninvertible ## 搜尋流程 Entity1 & Entity2 1. 抽取同時包含E1&E2的網頁 2. 無監督式抽取摘要: TextRank/DL 3. Rerank by TF 4. 取前N句且文字上限(500字)，作為relation summary ## 參考 [官網](https://lucene.apache.org/solr/guide/8_3/solr-tutorial.html)