Streaming Responses with Gaia Nodes (OpenAI Compatible)

# Streaming Responses with Gaia Nodes Gaia nodes provide streaming capabilities similar to OpenAI's APIs. By default, when you request a completion from a Gaia node, the entire completion is generated before being sent back in a single response. If you're generating long completions, waiting for the response can take many seconds. To get responses sooner, you can 'stream' the completion as it's being generated. This allows you to start printing or processing the beginning of the completion before the full completion is finished. To stream completions, set `stream=True` when calling the chat completions endpoints. This will return an object that streams back the response as data-only server-sent events. Extract chunks from the delta field rather than the message field. ## Prerequisites ```python import time from openai import OpenAI ``` ## 1. What a typical chat completion response looks like With a typical ChatCompletions API call, the response is first computed and then returned all at once. ```python # record the time before the request is sent start_time = time.time() # send a ChatCompletion request to count to 100 response = client.chat.completions.create( model='llama', messages=[ {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'} ], temperature=0 ) ``` The reply can be extracted with `response.choices[0].message`. The content of the reply can be extracted with `response.choices[0].message.content`. ## 2. How to stream a chat completion With a streaming API call, the response is sent back incrementally in chunks via an event stream. In Python, you can iterate over these events with a for loop. ```python response = client.chat.completions.create( model='llama', messages=[ {'role': 'user', 'content': "What's 1+1? Answer in one word."} ], temperature=0, stream=True # this time, we set stream=True ) for chunk in response: print(chunk) print(chunk.choices[0].delta.content) print("****************") ``` As you can see above, streaming responses have a delta field rather than a message field. The delta can contain: - A role token (e.g., `{"role": "assistant"}`) - A content token (e.g., `{"content": "text"}`) - Nothing when stream is over ## 3. How much time is saved by streaming a chat completion Let's look at how quickly we receive content with streaming: ```python start_time = time.time() response = client.chat.completions.create( model='llama', messages=[ {'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'} ], temperature=0, stream=True ) collected_chunks = [] collected_messages = [] for chunk in response: chunk_time = time.time() - start_time collected_chunks.append(chunk) chunk_message = chunk.choices[0].delta.content collected_messages.append(chunk_message) print(f"Message received {chunk_time:.2f} seconds after request: {chunk_message}") ``` **With streaming:** - First token arrives quickly (often <0.5s) - Subsequent tokens arrive every ~0.01-0.02s - User sees partial responses immediately **Without streaming:** - Must wait for full response (often several seconds) - No intermediate feedback **Choose streaming when you want to:** - Show partial results immediately - Provide responsive user experience - Handle long responses gracefully #### Credits Inspired by [this example](https://cookbook.openai.com/examples/how_to_stream_completions).