Document＆Index(Re,Shrink)

# Document＆Index(Re,Shrink) ## Documeat * 最小數據單位，每個 document 都儲存在某個 index 中 * document包含了資料，以及每個document都有一個_id * 可以把document視作為資料庫中的一筆資料(row) * 而document透過field組成 ## Index * 像是資料庫中的資料表 * 每個index都會有多個document(資料) --- | cluster | node | index |shard | document | field | | -------- | -------- | -------- |-------- | -------- | -------- | | 叢集 | 節點 | 索引 | 分片 | 文件 | 字段 | * 今天一個實例大致上會有 1 個cluster負責整個elasticsearch的運作 * node數量取決於用戶的資料規模，也取決於硬體設備，最基礎是 3 nodes * index和shard是影響整個node健康的因素之一！！ * 如果1天產生1個index,主分片切成3個，並且replica＝1 * 這樣每天會有6個分片(3主分片,3副本)，166天後node分片會到極限(預設1000shards) * 這是一天只有『1』個index的情況，更何況n個index * 因此接下來開始介紹 Reindex ＆ Shrink index --- ## Reindex * 在進行reindex前通常會有幾個情況 1. 過多index導致shards不足 2. 單一index資料量過小，佔用shards空間 3. index資料量龐大，需要重新分配shards * Reindex將old_index中的document“複製”至一個新的new_index * 這邊要注意，reindex後old_index還在 * new_index的設定要提前設定(shards＆mapping) * 要記得符合實際應用的規則 * `es預設的new_index主分片為1，副本分片為1` * 注意重要數據可以先進行snapshot備份後再進行reindex --- ## Reindex可能遇到的問題 * Reindex是單執行緒運行,運行效率差,提高單次吞吐量 * 可以在source中调整batch_size ```bash= POST _reindex { "source": { "index": "source", "size": 5000 #調整batch_size這是物理大小(控制每次批量處理的文檔數量) }, "dest": { "index": "dest" } } ``` * 使用Sliced Scroll改成多slice並發執行[官網](https://www.elastic.co/guide/en/elasticsearch/client/curator/current/option_slices.html) * 透過設定並行處理，提升效能 ```bash= POST _reindex?slices=5&refresh #並行5個處理 { "source": { "index": "old_index" }, "dest": { "index": "new_index" } } ``` * 如果是遠端(ES不在同一台主機)EX:要把A的資料移到B ```bash= POST _reindex { "source": { "remote": {#遠端 "host": "http://otherhost:9200",#要遠端的主機Ａ，可以先curl "username": "user",#帳號及密碼 "password": "pass" }, "index": "my-index-000001",#主機Ａ要移動的index "query": { "match": {#可以選擇要match的文件也可以all "test": "data" } } }, "dest": { "index": "my-new-index-000001"#在主機B上，新的index名字 } } ``` * 此外，主機B上的ES.yml需要加上兩條 ```bash= reindex.remote.whitelist: "10.20.30.156:19200"#允許遠端的ip reindex.ssl.verification_mode: none#關閉驗證憑證 ``` --- ## shrink index * shrink index更多的是在控制主分片的數量 1. 通常在主分片過多時，透過shrink來減少 2. 需要所有分片位於同一個節點上才可以操作 3. 分片處於 STARTED 狀態，建議是在叢集健康度為green時執行 4. 該index必須處於read-only的狀態 --- * 一開始可以先查詢分片資訊 ``` GET _cat/shards/source_index?v ``` * 接下來對index限制(read-only)以及分片移動 ```bash= PUT source_index/_settings { "settings": { "index.routing.allocation.require._name": "node-1",#限制分片道node-1 "index.blocks.write": true#將index設定為read-only } }#這段執行後要等待分片移動 ``` * 接下來就是開始執行shirnk ```bash= POST source_index/_shrink/shrunken_index { "settings": { "index.number_of_shards": 1,#設定主分片數量 "index.number_of_replicas": 1#設定副本分片數量 } } ``` * 檢查 ``` GET _cat/indices/shrunken_index?v ``` ``` GET shrunken_index/_count GET source_index/_count ```