Dynamic Pinecone Index Selection & Dataset-Level Migration

# Dynamic Pinecone Index Selection & Dataset-Level Migration This document provides a full technical overview of the new functionality that enables platform administrators to assign Pinecone indexes at the **dataset level**, migrate vectors across indexes, and dynamically retrieve indexes from Pinecone. --- ## 1. 🎯 Purpose of the Feature The platform currently uses a single global Pinecone index for all datasets. This enhancement introduces support for **multiple Pinecone indexes** and allows **each dataset** inside a workspace to be individually assigned to any existing or newly created index. Key goals: * Allow platform admins to assign an index per dataset * Allow creation of new indexes directly from the admin UI * Migrate existing vectors of a dataset to the selected index * Preserve namespace-per-dataset structure * Fetch all available Pinecone indexes dynamically This enables improved performance distribution, scalability, and operational flexibility. --- ## 2. 🧩 Current System Behavior * Only **one global Pinecone index** is used (e.g., `default-index`). * Every dataset in every workspace is stored as a **namespace** within this index. * Namespace = `dataset.id`. Example: ``` default-index ├── namespace: dataset_1 ├── namespace: dataset_2 └── namespace: dataset_3 ``` All vector operations — upsert, delete, search — operate within this single index. --- ## 3. 🧩 New Required Behavior Each dataset becomes independently assignable to any Pinecone index. ### When a platform admin selects or creates an index: 1. **Only the selected dataset** is affected. 2. The dataset’s existing vector data is **migrated** to the newly selected index. 3. Namespace remains unchanged (the dataset ID). 4. Other datasets in the workspace remain in their current indexes. 5. All future operations for that dataset use the newly assigned index. Resulting layout example: ``` index_A └── namespace: dataset_1 index_B └── namespace: dataset_2 default-index └── namespace: dataset_3 ``` This allows flexible distribution of datasets across indexes. --- ## 4. 🖥️ Admin-Only UI Behavior (Dataset-Level Settings) A new control appears inside **each dataset’s Knowledge Base configuration screen**. ### Admin capabilities: * **Select an existing Pinecone index** (from dynamic list) * **Create a new Pinecone index** * Apply selection → triggers dataset vector migration ### Visibility: * Only visible to **platform administrators** * Regular workspace users do not see or control index selection ### Scope: * Selection applies **only to the dataset currently being configured** * A workspace may have datasets spread across multiple indexes --- ## 5. 🔄 Dataset-Level Migration Flow When the administrator selects a new index for a dataset, the system performs: ### 1. Detect old and new index names Old index stored in the dataset metadata (`index_struct_dict`). ### 2. Fetch all vector IDs from the old index Using the dataset’s namespace (dataset ID). ### 3. Fetch dense vectors, sparse vectors, and metadata in batches ### 4. Upsert vectors into the new index Namespace remains: ``` namespace = dataset.id ``` ### 5. Delete vectors from the old index upon successful migration ### 6. Update dataset metadata `index_struct_dict['vector_store']['index_name'] = <selected index>` ### 7. All future writes go to the new index Only one dataset is affected per migration. --- ## 6. 🔌 Dynamic Retrieval of Existing Pinecone Indexes To enable index selection from the UI, the platform provides a new endpoint. ### **Endpoint** ``` GET /api/v1/vector-store/pinecone/indexes ``` ### **Purpose** * Fetch all existing Pinecone indexes directly from Pinecone * Provide up-to-date data for admin index selection UI ### **Backend logic** ```python pc = Pinecone(api_key=PINECONE_API_KEY) indexes = pc.list_indexes().names() ``` Optionally, for each index: ```python pc.describe_index(name) ``` ### **Example Response** ```json { "indexes": [ { "name": "default-index", "cloud": "aws", "region": "us-east-1", "dimension": 3072, "status": "ready" }, { "name": "customer-xyz", "cloud": "aws", "region": "us-east-1", "dimension": 3072, "status": "ready" } ] } ``` This ensures the UI always displays an accurate, real-time list of indexes. --- ## 7. 🛠 Code-Level Highlights (Required Adjustments) Below are targeted updates to support the feature. References include direct pointers to the existing code. --- ## 7.1 `vector_factory.py` ### **a. Store dataset-specific index_name in metadata** `index_struct_dict` must include: ```json "vector_store": { "index_name": "<selected_index_name>", "class_prefix": "<collection_name>" } ``` ### **b. Load index_name dynamically** Replace use of: ``` config.get('PINECONE_INDEX_NAME') ``` with: ```python index_name = self._dataset.index_struct_dict['vector_store']['index_name'] ``` This ensures each dataset initializes Pinecone using its own assigned index. ### **c. Inject into PineconeConfig dynamically** ```python config=PineconeConfig( api_key=config.get('PINECONE_API_KEY'), cloud=config.get('PINECONE_CLOUD'), region=config.get('PINECONE_REGION'), index_name=index_name, dimension=int(config.get('PINECONE_DIMENSIONS')), batch_size=int(config.get('PINECONE_BATCH_SIZE')) ) ``` No other modifications required. --- ## 7.2 `pinecone_vector.py` ### **a. Ensure dynamic index usage** Since `PineconeConfig.index_name` is now dataset-specific, all operations—upsert, delete, query—are already aligned. ### **b. Correct `get_collection_name()` behavior** Currently: ```python index_name = dataset.index_struct_dict['vector_store']['index_name'] return index_name ``` This should instead continue to use the dataset ID as the namespace: ```python return Dataset.gen_collection_name_by_id(dataset.id) ``` ### **c. Add dataset migration helper** A new internal method will handle the migration: ```python def migrate_to_new_index(self, new_index_name): # fetch from old index → upsert into new index → delete old pass ``` This is triggered by the admin action. ### **d. Namespace handling remains unchanged** All upserts and searches correctly use: ```python namespace=self._dataset_id ``` which aligns with dataset-specific separation. --- ## 8. 📚 Updated System Behavior Summary After implementing this feature: * Each dataset can reside in a different Pinecone index * Admins can dynamically assign or create indexes * Only the selected dataset’s vectors are migrated * Namespace structure remains unchanged * Search, hybrid search, upsert, and delete all operate through the dataset’s assigned index * Index lists are fetched dynamically from Pinecone via API This enables horizontal scaling, improved index distribution, and operational flexibility without altering user workflow. ---