Streaming ChatGPT Generations
Introduction
- ChatGPT is an auto-regressive Large Language Model. That means its output is generated token by token in a sequential fashion, where each token could be a combination of characters. Under normal circumstances (and popular coding practices), we access the ChatGPT model via an API which takes an input prompt, generate the output and then returns it. While it may sound okay, there is one problem -- model returns the output when the complete generation is done! This means if you want the model to generate long outputs (or even if your prompt is big due to few-shots examples or lengthy system prompts), you can expect a delay of several seconds before you receive the output.
- This is not okay for user-facing applications where your users are patiently waiting for the output. Thats why ChatGPT UI gives the output in streaming fashion. Here, you see characters or words printing on your screen one after the another, rather than showing the complete output at once. This creates a perception of model writing your output as human does, and even though the delay in generating the complete output will be the same, the flow becomes more enduring.
- Behind the scene, the ChatGPT API is using Server Side Events (SSE) i.e. media stream events to return each token as and when they are generated. SSE is like an intermediate approach between normal HTTP request (server returns one result per request) and websocket (server and client can send multiple requests and results). In SSE, while client sends one request, server can return multiple results.
- In this article, we will try to replicate the ChatGPT streaming output behavior by creating a python application (FastAPI server) that could acts as a middleman between OpenAI and Frontend. In this case, OpenAI returns the outputs to our Python server at token level and the server passes it along to its client in the same manner. Let's get started!
Code
OpenAI Streaming
- If you know how to use OpenAI package to generate ChatGPT output (which in itself is quite easy), then getting the streaming output nothing but adding one more param (
stream=True
) inopenai.ChatCompletion.create
. Below is the code that you can easily copy paste and start using right now,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
- In the above code, just by adding the
stream=True
OpenAI package is able to take care of all the hardwork and we get a completion generator. In Python, you just need to iterate over it to get the result at token level and as and when they are available. For example, if you time it with and without thestream=True
param, you will notice the difference in output and as well as in time. While without the param the output could be available after a couple of seconds, with the param the first token will be available within a second, and the subsequent tokens after the previous one with short gap. - To simulate the streaming output, we use print statement with
end=""
instead of defaultend="\n"
so that to ignore the newline after each iteration. We also useflush=True
so that print statment does not wait for its buffer to get full before printing on terminal.
OpenAI Streaming App (using FastAPI)
- Now that we have the OpenAI related part done, let's focus on creating FastAPI App and expose the OpenAI wrapper as an event stream service. Below is the code,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
|
- To run the code, you need to just hit
python api.py
and our endpoint is available athttp://0.0.0.0:8000/chat
endpoint!
Client for FastAPI OpenAI Streaming App
- Once you have the server running, let's see how we can hit it from any other service. Here I am showing the example of a Python Client.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|