GoAgent Source Deep Dive 09: Embedding Service — The Engineering Boundary of Vector Generation

The Problem: How Do You Turn Text Into Vectors

GoAgent's retrieval depends on vector similarity search. PostgreSQL + pgvector can do vector search, but you need vectors first. How does text become a vector?

Limitations of Existing Approaches

CGO bindings for vector models in Go — Complex compilation, cross-platform difficulties, Go's ML ecosystem is immature. Python subprocess from Go — High latency, complex process management. External API (OpenAI Embedding) — External dependency, network latency and cost, not suitable for local deployment.

All three have engineering pain points. What GoAgent needs: vector generation decoupled from the Go main program, supporting local deployment and flexible backend switching.

GoAgent's Approach

Deploy embedding as an independent HTTP service:

Python FastAPI implementation, leveraging Python's ML ecosystem.
Go calls via HTTP, no CGO or subprocess dependency.
Supports Ollama and SentenceTransformers backends, switchable via environment variable.
Optional Redis cache, L2 normalization for retrieval consistency.

flowchart LR subgraph "Go Main Program" Retrieval[RetrievalService] Client[EmbeddingClient] end subgraph "Embedding Service" FastAPI[FastAPI] Cache[Redis] end subgraph "Backends" Ollama[Ollama] ST[SentenceTransformers] end Retrieval --> Client Client -->|HTTP| FastAPI FastAPI --> Cache FastAPI --> Ollama FastAPI --> ST

Architecture Naturally Emerges

Go Side: EmbeddingClient

type EmbeddingClient struct {
    baseURL    string
    httpClient *http.Client
    retries    int
}

Two core methods: Embed (single text) and EmbedBatch (batch). Constructs JSON, POSTs to /embed or /embed_batch, parses response.

Python Side: FastAPI Service

Three endpoints: /embed, /embed_batch, /health.

Single text flow: check cache → generate vector → L2 normalize → write cache → return. Batch flow: filter cached → batch generate uncached → normalize and cache each → return.

Backend Switching

EMBEDDING_BACKEND environment variable switches Ollama or SentenceTransformers. No code changes. Dev uses sentence-transformers (zero config), production uses Ollama (high performance).

Vector Normalization

All vectors L2-normalized before return (norm = 1). After normalization, cosine similarity equals dot product — faster. Without normalization, long text vectors may have larger magnitudes, causing retrieval bias.

Cache

Redis optional: same text doesn't recompute, reduces backend pressure. Service works without Redis, just uncached.

Design Trade-offs

Independent service vs embedded: Extra network hop and deployment component, but language decoupling, independent scaling, flexible backend.
HTTP vs gRPC: Easy debugging, language-agnostic. For request-response patterns, HTTP overhead acceptable.
Optional vs mandatory cache: Zero config for dev, on-demand for production. Service availability independent of Redis.

Summary

The Embedding Service is the prerequisite for GoAgent's retrieval. Deploying independently is the trade-off between engineering complexity and operational flexibility. Architecture naturally emerged from "how does text become vectors" — Python ecosystem for vector generation, Go for business logic, HTTP as the bridge.