Skip to main content

High Concurrency Tuning

This document provides best practices for high-concurrency scenarios.

Connection Pooling​

Reuse HTTP connections to reduce connection establishment overhead:

import httpx
from openai import OpenAI

client = OpenAI(
api_key="sk-real200-xxx",
base_url="https://real200.com/v1",
http_client=httpx.Client(limits=httpx.Limits(max_connections=100))
)

Concurrency Control​

Recommended to use semaphores to control concurrency:

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(api_key="sk-real200-xxx", base_url="https://real200.com/v1")
semaphore = asyncio.Semaphore(20) # Max concurrent 20

async def call_api(prompt):
async with semaphore:
return await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)

Rate Limit Handling​

High concurrency more easily triggers rate limits (429 errors). Recommendations:

  • Implement exponential backoff retry
  • Spread requests across different time windows
  • Contact admin to increase rate limit thresholds