Enterprise · Technical Reference
Deployment reference for the EthiCompass Enterprise platform. On-premise, private cloud, and hybrid configurations for regulated enterprises. Structured sizing for every scenario, from minimum viable deployment to high-concurrency production.
This page is a technical reference. For commercial discussion, implementation timelines, and engagement scope, contact our team.
Deployment Modes
Enterprise supports three deployment modes. Each preserves the immutable audit trail, the 7-dimension evaluation framework, and the EU AI Act Articles 9–15 mapping. The difference is where the inference workload runs and what infrastructure footprint you accept.
Deployed inside your data center. All inference runs locally on dedicated GPUs. No outbound calls. Suitable for the strictest data residency and regulatory requirements, including sectors with absolute on-premise mandates.
Deployed in your AWS, Azure, or GCP tenancy. Inference runs on managed GPU instances within your network boundary. Data remains in your cloud account. Suitable for enterprises with approved cloud regions and residency controls.
Local Embeddings + Cloud Inference
Embeddings run locally on a small GPU. The two larger models are served through a regulated inference provider over private endpoints. Lower hardware footprint, faster time to deployment. Suitable for pilot deployments or environments with limited local GPU capacity.
Deployment mode is a decision made during the Architecture Assessment phase of implementation. All modes produce the same compliance evidence and the same immutable audit trail.
Reference Architecture
The Enterprise platform is composed of infrastructure services and evaluation models. The infrastructure services manage data storage, versioning, messaging, and the application layer. The evaluation models produce the quantitative scores behind the 7-dimension framework.
| Component | Role |
|---|---|
| MinIO | S3-compatible object storage for datasets and evidence artifacts |
| LakeFS | Versioned dataset management with full lineage |
| Apache Kafka + Zookeeper | Event streaming for evaluation pipelines and audit events |
| API Orchestrator (FastAPI) | Coordinates evaluation requests across dimensions |
| Metrics Service (FastAPI) | Aggregates scores and produces the compliance scorecards |
| Model | Role in the 7-Dimension Framework | Served Via |
|---|---|---|
| Qwen3-Embedding (600M parameters) | Embeddings for Toxicity and Regulatory Compliance dimensions | vLLM (embedding task) |
| Gemma 4 E4B (4.7B parameters) | Judge model for the Context dimension (LLM-as-Judge) | vLLM |
| Llama Guard 4 (12B parameters) | Guardian model for the Bias dimension | vLLM |
All models are served through vLLM with paged attention, continuous batching, and efficient KV cache management. The embedding model is shared across two dimensions and runs as a single instance.
Configuration Profiles
Three reference configurations. Each is validated for a specific deployment posture. Select the profile that matches your evaluation concurrency, residency requirements, and regulatory commitments.
| Resource | MinimumINT4, sequential | RecommendedINT4, parallel | Productionbf16, high concurrency |
|---|---|---|---|
| GPU | 1 × 16–24 GB VRAM (e.g., RTX 3090, RTX 4080, RTX 4090) | 1 × 24 GB VRAM (e.g., RTX 4090, A5000) | 1 × 80 GB VRAM (A100) or 2 × 48 GB (A6000) |
| System RAM | 32 GB DDR4/DDR5 | 64 GB DDR5 | 128 GB DDR5 ECC |
| CPU | 14 cores | 20 cores | 32 cores |
| Storage | 256 GB NVMe SSD | 512 GB NVMe SSD | 1 TB NVMe SSD |
| Network | 1 Gbps | 1 Gbps | 10 Gbps |
| Use case | Proof of concept, low-volume evaluation | Standard enterprise deployment | Multi-tenant, high-concurrency, real-time monitoring |
All profiles assume a single-node deployment. Multi-node clustering is available for production environments with failover requirements. Discuss during Architecture Assessment.
GPU & Inference Sizing
The three evaluation models have materially different memory profiles. The deployment can run in full precision, INT8, or INT4. Each scenario is a tradeoff between VRAM footprint, throughput, and evaluation latency. All three scenarios produce scores inside the operational tolerance of the 7-dimension framework.
| Scenario | Precision | Qwen3 (0.6B) | Gemma 4 (4.7B) | Llama Guard 4 (12B) | Total VRAM |
|---|---|---|---|---|---|
| A — Full precision | bf16 | ~2.5 GB | ~12.0 GB | ~27.7 GB | ~42.2 GB |
| B — INT4 quantized | embeddings bf16, models INT4 | ~2.5 GB | ~5.0 GB | ~9.7 GB | ~17.2 GB |
| C — INT8 quantized | embeddings bf16, models INT8 | ~2.5 GB | ~7.0 GB | ~15.7 GB | ~25.2 GB |
| Scenario | Single-GPU Option | Multi-GPU Option |
|---|---|---|
| A (bf16) | 1 × NVIDIA A100 80 GB | 2 × NVIDIA A6000 48 GB |
| B (INT4) | 1 × NVIDIA RTX 4090 24 GB | 1 × NVIDIA A5000 24 GB |
| C (INT8) | 1 × NVIDIA L40 48 GB | 2 × NVIDIA RTX 4080 16 GB |
All sizing above assumes single-request baseline per model. vLLM handles multiple concurrent requests through continuous batching, which grows the KV cache pool proportionally.
If evaluations run sequentially (one model active at a time), the inactive model can be offloaded from GPU memory. The platform then only requires the largest model plus the embeddings:
Sequential execution reduces VRAM but increases end-to-end evaluation latency. Suitable for audit-style evaluation or low-volume Enterprise pilots. Not recommended for real-time monitoring of production AI systems.
Infrastructure Services
The infrastructure services run on CPU and system RAM. They are independent of the GPU workload and can be co-located on the same node as the evaluation models or distributed across dedicated hosts.
| Component | RAM (min) | RAM (rec) | CPU (min) | CPU (rec) | Storage |
|---|---|---|---|---|---|
| MinIO | 2 GB | 4 GB | 2 cores | 4 cores | 50 GB NVMe SSD min, variable |
| LakeFS | 1 GB | 2 GB | 1 core | 2 cores | 10 GB for metadata |
| LakeFS PostgreSQL | 1 GB | — | — | — | Included with LakeFS |
| Apache Kafka | 2 GB | 4 GB | 2 cores | 4 cores | 20 GB for message logs |
| Zookeeper | 512 MB | 1 GB | — | — | Included with Kafka |
| API Orchestrator (FastAPI) | 256 MB | 512 MB | 1 core | 1 core | — |
| Metrics Service (FastAPI) | 512 MB | 1 GB | 2 cores | 2 cores | — |
MinIO is I/O-intensive. NVMe SSD is required for any deployment handling datasets larger than 10 GB. Direct-attached storage is preferred over network-attached storage for the MinIO volume.
Storage, Network & Operating System
| Item | Allocation |
|---|---|
| Operating system and Docker runtime | 30 GB |
| Docker images (services) | 15 GB |
| Evaluation model weights (from Hugging Face) | ~35 GB |
| MinIO dataset storage | 50 GB minimum, variable with use |
| Kafka logs | 20 GB |
| LakeFS metadata | 10 GB |
| Total storage | 160 GB minimum, NVMe SSD |
Hybrid Deployment
For pilot deployments or environments without local GPU capacity, Enterprise supports a hybrid configuration. The embedding model runs locally on a small GPU. The two larger evaluation models (Gemma 4 and Llama Guard 4) are served through a regulated inference provider over private endpoints.
This configuration preserves the 7-dimension framework and the immutable audit trail. The tradeoff is an operational dependency on the inference provider and per-request inference costs instead of a one-time hardware investment.
| Resource | Requirement |
|---|---|
| GPU | Any GPU with 4+ GB VRAM (e.g., NVIDIA T4, RTX 3060) |
| System RAM | 16 GB |
| CPU | 8 cores |
| Storage | 160 GB NVMe SSD |
| Inference provider | Regulated provider with private endpoints and EU-region routing |
Next Steps
This page is a reference, not a procurement checklist. Every deployment is calibrated to the organization's regulatory posture, AI system inventory, and existing infrastructure. The Architecture Assessment — part of the standard Enterprise implementation — turns this reference into a concrete deployment plan.
A focused technical session with our platform team. We review your environment, your residency constraints, and your expected evaluation volume. You leave with a sized deployment plan.
If you are earlier in evaluation, our team can walk through the Enterprise product, the 7-dimension framework, and deployment options in a 30-minute conversation.
For the full product overview, differentiators, and implementation timeline, return to the Enterprise product page.
Back to EnterpriseLast updated: April 2026