EthiCompass

Enterprise · Technical Reference

Hardware & Software
Requirements.

Deployment reference for the EthiCompass Enterprise platform. On-premise, private cloud, and hybrid configurations for regulated enterprises. Structured sizing for every scenario, from minimum viable deployment to high-concurrency production.

CISOsHeads of PlatformInfrastructure LeadsProcurement

This page is a technical reference. For commercial discussion, implementation timelines, and engagement scope, contact our team.

Back to Enterprise

Deployment Modes

Three paths. One platform.
Data sovereignty preserved.

Enterprise supports three deployment modes. Each preserves the immutable audit trail, the 7-dimension evaluation framework, and the EU AI Act Articles 9–15 mapping. The difference is where the inference workload runs and what infrastructure footprint you accept.

On-Premise

Deployed inside your data center. All inference runs locally on dedicated GPUs. No outbound calls. Suitable for the strictest data residency and regulatory requirements, including sectors with absolute on-premise mandates.

Private Cloud

Deployed in your AWS, Azure, or GCP tenancy. Inference runs on managed GPU instances within your network boundary. Data remains in your cloud account. Suitable for enterprises with approved cloud regions and residency controls.

Hybrid

Local Embeddings + Cloud Inference

Embeddings run locally on a small GPU. The two larger models are served through a regulated inference provider over private endpoints. Lower hardware footprint, faster time to deployment. Suitable for pilot deployments or environments with limited local GPU capacity.

Deployment mode is a decision made during the Architecture Assessment phase of implementation. All modes produce the same compliance evidence and the same immutable audit trail.

Reference Architecture

What Enterprise Runs

The Enterprise platform is composed of infrastructure services and evaluation models. The infrastructure services manage data storage, versioning, messaging, and the application layer. The evaluation models produce the quantitative scores behind the 7-dimension framework.

Infrastructure Services

ComponentRole
MinIOS3-compatible object storage for datasets and evidence artifacts
LakeFSVersioned dataset management with full lineage
Apache Kafka + ZookeeperEvent streaming for evaluation pipelines and audit events
API Orchestrator (FastAPI)Coordinates evaluation requests across dimensions
Metrics Service (FastAPI)Aggregates scores and produces the compliance scorecards

Evaluation Models

ModelRole in the 7-Dimension FrameworkServed Via
Qwen3-Embedding (600M parameters)Embeddings for Toxicity and Regulatory Compliance dimensionsvLLM (embedding task)
Gemma 4 E4B (4.7B parameters)Judge model for the Context dimension (LLM-as-Judge)vLLM
Llama Guard 4 (12B parameters)Guardian model for the Bias dimensionvLLM

All models are served through vLLM with paged attention, continuous batching, and efficient KV cache management. The embedding model is shared across two dimensions and runs as a single instance.

Configuration Profiles

Minimum · Recommended · Production

Three reference configurations. Each is validated for a specific deployment posture. Select the profile that matches your evaluation concurrency, residency requirements, and regulatory commitments.

ResourceMinimumINT4, sequentialRecommendedINT4, parallelProductionbf16, high concurrency
GPU1 × 16–24 GB VRAM (e.g., RTX 3090, RTX 4080, RTX 4090)1 × 24 GB VRAM (e.g., RTX 4090, A5000)1 × 80 GB VRAM (A100) or 2 × 48 GB (A6000)
System RAM32 GB DDR4/DDR564 GB DDR5128 GB DDR5 ECC
CPU14 cores20 cores32 cores
Storage256 GB NVMe SSD512 GB NVMe SSD1 TB NVMe SSD
Network1 Gbps1 Gbps10 Gbps
Use caseProof of concept, low-volume evaluationStandard enterprise deploymentMulti-tenant, high-concurrency, real-time monitoring

All profiles assume a single-node deployment. Multi-node clustering is available for production environments with failover requirements. Discuss during Architecture Assessment.

GPU & Inference Sizing

Three inference scenarios.
Three VRAM profiles.

The three evaluation models have materially different memory profiles. The deployment can run in full precision, INT8, or INT4. Each scenario is a tradeoff between VRAM footprint, throughput, and evaluation latency. All three scenarios produce scores inside the operational tolerance of the 7-dimension framework.

VRAM by Scenario

ScenarioPrecisionQwen3 (0.6B)Gemma 4 (4.7B)Llama Guard 4 (12B)Total VRAM
A — Full precisionbf16~2.5 GB~12.0 GB~27.7 GB~42.2 GB
B — INT4 quantizedembeddings bf16, models INT4~2.5 GB~5.0 GB~9.7 GB~17.2 GB
C — INT8 quantizedembeddings bf16, models INT8~2.5 GB~7.0 GB~15.7 GB~25.2 GB

Recommended GPUs per Scenario

ScenarioSingle-GPU OptionMulti-GPU Option
A (bf16)1 × NVIDIA A100 80 GB2 × NVIDIA A6000 48 GB
B (INT4)1 × NVIDIA RTX 4090 24 GB1 × NVIDIA A5000 24 GB
C (INT8)1 × NVIDIA L40 48 GB2 × NVIDIA RTX 4080 16 GB

Concurrency Planning

All sizing above assumes single-request baseline per model. vLLM handles multiple concurrent requests through continuous batching, which grows the KV cache pool proportionally.

  • Each active request adds ~50–100 MB of KV cache on Gemma 4
  • Each active request adds ~150–250 MB of KV cache on Llama Guard 4
  • Budget an additional ~2–3 GB of VRAM per 8 concurrent requests

Sequential Execution Option

If evaluations run sequentially (one model active at a time), the inactive model can be offloaded from GPU memory. The platform then only requires the largest model plus the embeddings:

  • bf16: ~30.2 GB (Llama Guard 4 + embeddings)
  • INT8: ~18.2 GB (Llama Guard 4 INT8 + embeddings)
  • INT4: ~12.2 GB (Llama Guard 4 INT4 + embeddings)

Sequential execution reduces VRAM but increases end-to-end evaluation latency. Suitable for audit-style evaluation or low-volume Enterprise pilots. Not recommended for real-time monitoring of production AI systems.

Infrastructure Services

RAM, CPU, and storage for the
non-GPU components.

The infrastructure services run on CPU and system RAM. They are independent of the GPU workload and can be co-located on the same node as the evaluation models or distributed across dedicated hosts.

ComponentRAM (min)RAM (rec)CPU (min)CPU (rec)Storage
MinIO2 GB4 GB2 cores4 cores50 GB NVMe SSD min, variable
LakeFS1 GB2 GB1 core2 cores10 GB for metadata
LakeFS PostgreSQL1 GBIncluded with LakeFS
Apache Kafka2 GB4 GB2 cores4 cores20 GB for message logs
Zookeeper512 MB1 GBIncluded with Kafka
API Orchestrator (FastAPI)256 MB512 MB1 core1 core
Metrics Service (FastAPI)512 MB1 GB2 cores2 cores

MinIO is I/O-intensive. NVMe SSD is required for any deployment handling datasets larger than 10 GB. Direct-attached storage is preferred over network-attached storage for the MinIO volume.

Storage, Network & Operating System

Baseline footprint for a
single-node deployment.

Storage — Total Footprint

ItemAllocation
Operating system and Docker runtime30 GB
Docker images (services)15 GB
Evaluation model weights (from Hugging Face)~35 GB
MinIO dataset storage50 GB minimum, variable with use
Kafka logs20 GB
LakeFS metadata10 GB
Total storage160 GB minimum, NVMe SSD

Network

  • 1 Gbps minimum for minimum and recommended profiles
  • 10 Gbps recommended for production profile
  • HTTPS egress for initial model weight download (or internal mirror if egress is restricted)

Operating System & Runtime

  • Linux x86_64 (Ubuntu 22.04 LTS or equivalent RHEL-family)
  • Docker 24.x or containerd 1.7+
  • NVIDIA Container Toolkit for GPU passthrough
  • Kubernetes 1.28+ supported for orchestrated deployments

Security Controls

  • Air-gapped environments with pre-mirrored model registry
  • SSO/LDAP integration for platform access
  • SIEM-compatible audit log streaming
  • Full-disk encryption required for immutable audit trail retention

Hybrid Deployment

When local GPU capacity
is constrained.

For pilot deployments or environments without local GPU capacity, Enterprise supports a hybrid configuration. The embedding model runs locally on a small GPU. The two larger evaluation models (Gemma 4 and Llama Guard 4) are served through a regulated inference provider over private endpoints.

This configuration preserves the 7-dimension framework and the immutable audit trail. The tradeoff is an operational dependency on the inference provider and per-request inference costs instead of a one-time hardware investment.

Hybrid Configuration Requirements

ResourceRequirement
GPUAny GPU with 4+ GB VRAM (e.g., NVIDIA T4, RTX 3060)
System RAM16 GB
CPU8 cores
Storage160 GB NVMe SSD
Inference providerRegulated provider with private endpoints and EU-region routing

When to Choose Hybrid

  • Pilot deployments where local GPU procurement is not yet approved
  • Environments where data residency permits regulated third-party inference
  • Organizations optimizing for rapid deployment over long-term unit economics

When to Choose On-Premise or Private Cloud

  • Strict data residency mandates that prohibit third-party inference
  • Sectors where all AI inference must run inside the regulated environment
  • High-volume production workloads where per-request costs exceed amortized hardware cost

Next Steps

From reference to deployment.

This page is a reference, not a procurement checklist. Every deployment is calibrated to the organization's regulatory posture, AI system inventory, and existing infrastructure. The Architecture Assessment — part of the standard Enterprise implementation — turns this reference into a concrete deployment plan.

Request an Architecture Review

A focused technical session with our platform team. We review your environment, your residency constraints, and your expected evaluation volume. You leave with a sized deployment plan.

Talk to Our Team

If you are earlier in evaluation, our team can walk through the Enterprise product, the 7-dimension framework, and deployment options in a 30-minute conversation.

Return to Enterprise

For the full product overview, differentiators, and implementation timeline, return to the Enterprise product page.

Back to Enterprise

Last updated: April 2026