Openllmetry
title : 'OpenLLMetry' summary : 'OpenLLMetry is developed and maintained by the Traceloop team under the Apache 2.0 license. It provides specialized monitoring and debugging tools for LLM applications by extending OpenTelemetry capabilities. It leverages OpenTelemetry's standardized telemetry data format to standardize the output of key performance metrics and trace information from LLM applications.' tags : - 'OTEL' - 'APM' __int_icon: 'icon/openllmetry' dashboard: - desc: 'OpenLLMetry' path: 'dashboard/zh/openllmetry'
OpenLLMetry is developed and maintained by the Traceloop team under the Apache 2.0 license. It provides specialized monitoring and debugging tools for LLM applications by extending OpenTelemetry capabilities. It leverages OpenTelemetry's standardized telemetry data format to standardize the output of key performance metrics and trace information from LLM applications.
Configuration¶
Before sending trace data to DataKit via OTEL, ensure that you have configured the Collector. Additionally, adjust the configuration file customer_tags = ["llm.request.type","traceloop.entity.path","llm.is_streaming","gen_ai.openai.api_base","gen_ai.prompt.1.content","gen_ai.response.model","gen_ai.completion.0.content","gen_ai.request.model","gen_ai.request.temperature","gen_ai.system","traceloop.workflow.name"]
as follows:
[[inputs.opentelemetry]]
## customer_tags will work as a whitelist to prevent tags send to data center.
## All . will replace to _ ,like this :
## "project.name" to send to GuanCe center is "project_name"
customer_tags = ["llm.request.type","traceloop.entity.path","llm.is_streaming","gen_ai.openai.api_base","gen_ai.prompt.1.content","gen_ai.response.model","gen_ai.completion.0.content","gen_ai.request.model","gen_ai.request.temperature","gen_ai.system","traceloop.workflow.name"]
...
After completing the adjustments, restart DataKit
Install OpenTelemetry SDK¶
pip install opentelemetry-api opentelemetry-instrumentation pip install opentelemetry-instrumentation-flask
Install OpenLLMetry SDK¶
pip install traceloop-sdk
Initialize OpenLLMetry in your application¶
from traceloop.sdk import Traceloop
# Initializing OpenLit
# Traceloop.init()
Traceloop.init(app_name="kimi_openllmetry_stream_flask")
OpenLLMetry Sample Code¶
import os
import httpx
from flask import Flask, request, Response,jsonify,stream_with_context
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow,task
from openai import OpenAI
from opentelemetry.instrumentation.flask import FlaskInstrumentor
app = Flask(__name__)
# Automatically instrument the Flask app using FlaskInstrumentor
FlaskInstrumentor().instrument_app(app)
# Initializing OpenLit
Traceloop.init(app_name="kimi_openllmetry_stream_flask")
# Retrieving API Key from environment variable
api_key = os.getenv("MOONSHOT_API_KEY")
if not api_key:
raise ValueError("Please set the MOONSHOT_API_KEY environment variable")
client = OpenAI(
api_key=api_key,
base_url="https://api.moonshot.cn/v1",
)
def estimate_token_count(input_messages) -> int:
"""
Calculate the number of Tokens in the input messages.
"""
try:
header = {
"Authorization": f"Bearer {api_key}",
}
data = {
"model": "moonshot-v1-128k",
"messages": input_messages,
}
with httpx.Client() as client:
print("Calling API endpoint")
r = client.post("https://api.moonshot.cn/v1/tokenizers/estimate-token-count", headers=header, json=data)
r.raise_for_status()
response_data = r.json()
print(response_data["data"]["total_tokens"])
return response_data["data"]["total_tokens"]
except httpx.RequestError as e:
print(f"Request failed: {e}")
raise
except (KeyError, ValueError) as e:
print(f"Parsing response failed: {e}")
raise
def select_model(input_messages, max_tokens=1024) -> str:
"""
Select an appropriate model based on the provided context messages and the expected max_tokens value.
"""
if not isinstance(max_tokens, int) or max_tokens <= 0:
raise ValueError("max_tokens must be a positive integer")
prompt_tokens = estimate_token_count(input_messages)
total_tokens = prompt_tokens + max_tokens
if total_tokens <= 8 * 1024:
return "moonshot-v1-8k"
elif total_tokens <= 32 * 1024:
return "moonshot-v1-32k"
elif total_tokens <= 128 * 1024:
return "moonshot-v1-128k"
else:
raise ValueError("Token count exceeds limit 😢")
@app.route('/ask', methods=['POST'])
@workflow(name="ask_workflow")
def ask():
data = request.json
messages = data.get('messages')
max_tokens = data.get('max_tokens', 2048)
if not messages:
return jsonify({"error": "The messages field cannot be empty"}), 400
try:
model = select_model(messages, max_tokens)
completion = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens,
temperature=0.3,
stream=True # Enable streaming generation
)
def generate():
for chunk in completion:
# yield chunk.choices[0].delta.content or ''
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="")
yield delta.content or ''
return Response(stream_with_context(generate()), content_type='text/event-stream')
except Exception as e:
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(debug=True,port=5001)
Configure env to report data to Datakit via OpenTelemetry
export TRACELOOP_BASE_URL=http://localhost:9529/otel
Metric Details¶
Metric Name | Description | Unit |
---|---|---|
gen_ai.client.generation.choices |
Number of choices generated by the client | count |
gen_ai.client.operation.duration_bucket |
Histogram bucket for duration of client operations | ms |
gen_ai.client.operation.duration_count |
Total number of client operations | times |
gen_ai.client.operation.duration_max |
Maximum duration of client operations | ms |
gen_ai.client.operation.duration_min |
Minimum duration of client operations | ms |
gen_ai.client.operation.duration_sum |
Total duration of client operations | ms |
llm.openai.chat_completions.streaming_time_to_first_token_bucket |
Histogram bucket for time taken to generate the first token during streaming chat completions in OpenAI | ms |
llm.openai.chat_completions.streaming_time_to_first_token_count |
Total number of times the first token was generated during streaming chat completions in OpenAI | times |
llm.openai.chat_completions.streaming_time_to_first_token_max |
Maximum time taken to generate the first token during streaming chat completions in OpenAI | ms |
llm.openai.chat_completions.streaming_time_to_first_token_min |
Minimum time taken to generate the first token during streaming chat completions in OpenAI | ms |
llm.openai.chat_completions.streaming_time_to_first_token_sum |
Total time taken to generate the first token during streaming chat completions in OpenAI | ms |
llm.openai.chat_completions.streaming_time_to_generate_bucket |
Histogram bucket for total time taken to generate content during streaming chat completions in OpenAI | ms |
llm.openai.chat_completions.streaming_time_to_generate_count |
Total number of times content was generated during streaming chat completions in OpenAI | times |
llm.openai.chat_completions.streaming_time_to_generate_max |
Maximum time taken to generate content during streaming chat completions in OpenAI | ms |
llm.openai.chat_completions.streaming_time_to_generate_min |
Minimum time taken to generate content during streaming chat completions in OpenAI | ms |
llm.openai.chat_completions.streaming_time_to_generate_sum |
Total time taken to generate content during streaming chat completions in OpenAI | ms |
References¶
- OpenLLMetry quickstart
- OpenLLMetry otel-collector
- OpenLLMetry github