Deploying Collibra Unstructured AI infrastructure
Deploying Unstructured AI infrastructure involves setting up the necessary resources in your cloud environment to support advanced AI applications. This page provides step-by-step instructions for deploying the infrastructure, including prerequisites, configuration details, deployment steps, and optional resources. It also covers troubleshooting common issues and upgrading existing deployments to newer versions.
Tip Deploy Unstructured AI infrastructure in a dedicated AWS subaccount for better resource isolation and management. If you choose to set up a subaccount, follow the AWS Organizations documentation to create a new account under your organization.
Prerequisites
- You must have the following tools installed on your local machine:
- You must have an AWS account with the necessary permissions to create and manage resources.
- An S3 bucket to store the Terraform state file. You can create a new bucket or use an existing one. Update the following in the
providers.tffile:bucket = "your-terraform-state-bucket-name"key = "path/to/your/terraform.tfstate"region = "your-aws-region"
- A Route53 hosted zone for managing DNS records. If you plan to use a custom domain name for accessing the Unstructured AI application, ensure that you have a hosted zone set up in Route53. The zone ID will be required in the
terraform.tfvarsfile.
Bring your own VPN
If you prefer to use your own VPN solution instead of the one provided in this deployment:
- Set up your VPN solution according to the provider's documentation.
- Ensure that the VPN allows access to the necessary AWS resources. See the
/iac/aws/modules/vpndirectory for details on what is required.
Deployment steps
- Download and extract the Terraform tarball (.tgz) from the Collibra downloads page.
- Authenticate with the AWS account where you want to deploy the infrastructure. Use the AWS CLI to configure your credentials:
Copy
aws configure - Update the
terraform.tfvarsfile with your desired configuration parameters. Refer to the comments in the file for guidance on each parameter. - Run the Terraform commands to deploy the infrastructure:
Copy
terraform init
terraform apply -target=module.networking -target=module.eks -target=module.rds -target=module.vpn -target=module.iam - Once the EKS cluster is up, provide the worker node group role ARN to the Unstructured AI team so it can be added to the ECR registry policy to allow the cluster to pull images from the private ECR repository. After this, continue with the following Terraform command to apply everything else and ensure your infrastructure is fully in sync with your configuration:
Copy
terraform apply - If you did not use a custom VPN, download and connect to the VPN:
- Download the aws client VPN from AWS: https://aws.amazon.com/vpn/client-vpn-download/.
- Open the VPN client and import the downloaded config file: File → Add Profile → select the .ovpn file that was generated in the /modules/vpn module.
- Connect to the VPN using the created profile.
- Verify the deployment was successful:
- Access the Unstructured AI application through the provided domain name to ensure you can reach the home page.
- Use
kubectlto ensure all pods in the EKS cluster are running as expected:Copyaws eks --region <your-region> update-kubeconfig --name <your-cluster-name>
kubectl get pods --all-namespaces
Optional EKS resources
The following resources are optional and you can include them in your deployment based on your requirements:
- OpenTelemetry
- For monitoring the health and performance of your infrastructure.
- Cluster Autoscaler
- Automatically adjusts the number of nodes in your EKS cluster based on workload demands.
- linkerd
- A service mesh to manage microservices communication in your EKS cluster.
- cert-manager
- Manages TLS certificates in your Kubernetes cluster.
- linkerd-certs
- Creates the necessary certificates for linkerd and includes a cron job to rotate them monthly.
Upgrading Unstructured AI infrastructure
To upgrade your Unstructured AI infrastructure to a newer version:
- Download and extract the latest Terraform TGZ tarball from the Collibra downloads page.
- Review the release notes for any breaking changes or important information regarding the upgrade.
- Update your existing
terraform.tfvarsfile with any new parameters or changes introduced in the new version. - Run the following Terraform commands to apply the upgrade:
Copy
terraform init
terraform apply
Update Helm installation with new parameters
To update the Helm installation with new parameters:
- Update your existing
terraform.tfvarsfile with any new parameters or changes introduced in the new version. - Run the following command to apply the changes:
Copy
terraform apply
Troubleshooting Helm provider OCI registry authentication errors
There is a known issue caused by a bug in Helm provider version 3.x where the repository_password is cached in Terraform state. When ECR authorization tokens expire after 12 hours, Terraform continues to use the expired token from state instead of fetching a fresh token from the data.aws_ecr_authorization_token data source.
The error message is:
Error: OCI Registry Login Failed
Failed to log in to OCI registry "oci://...": response status code 403: denied: Your authorization token has expired. Reauthenticate and try again.
Steps to resolve
The workaround involves removing the affected Helm releases from Terraform state and re-importing them. This clears the cached expired credentials and allows Terraform to use fresh authentication tokens.
Remove and re-import all Helm releases that pull from ECR:
terraform state rm module.frontend.helm_release.frontend
terraform import module.frontend.helm_release.frontend unstructured/unstructured-frontend
terraform state rm module.backend.helm_release.backend
terraform import module.backend.helm_release.backend unstructured/unstructured-backend
terraform state rm module.linkerd_certs.helm_release.linkerd_certs
terraform import module.linkerd_certs.helm_release.linkerd_certs linkerd/linkerd-certs
terraform state rm module.linkerd_certs.helm_release.cert_manager
terraform import module.linkerd_certs.helm_release.cert_manager cert-manager/cert-manager
terraform state rm module.linkerd_crds.helm_release.linkerd_crds
terraform import module.linkerd_crds.helm_release.linkerd_crds linkerd/linkerd-crds
terraform state rm module.linkerd.helm_release.linkerd
terraform import module.linkerd.helm_release.linkerd linkerd/linkerd
terraform state rm module.ingress.helm_release.aws_load_balancer_controller
terraform import module.ingress.helm_release.aws_load_balancer_controller kube-system/aws-load-balancer-controller
terraform state rm module.eks_workload_addons.helm_release.argo_events
terraform import module.eks_workload_addons.helm_release.argo_events unstructured/argo-events
terraform state rm module.eks_workload_addons.helm_release.argo_workflows
terraform import module.eks_workload_addons.helm_release.argo_workflows unstructured/argo-workflows
terraform state rm module.eks_workload_addons.helm_release.external_secrets_operator
terraform import module.eks_workload_addons.helm_release.external_secrets_operator unstructured/external-secrets
terraform state rm module.eks_workload_addons.helm_release.otel_collector
terraform import module.eks_workload_addons.helm_release.otel_collector unstructured/otel-collector
terraform state rm module.eks_workload_addons.helm_release.aws_ebs_csi_driver
terraform import module.eks_workload_addons.helm_release.aws_ebs_csi_driver kube-system/aws-ebs-csi-driver
terraform state rm module.eks_workload_addons.helm_release.cluster-autoscaler
terraform import module.eks_workload_addons.helm_release.cluster-autoscaler kube-system/cluster-autoscaler
terraform apply
When this occurs
- After ECR authorization tokens expire (tokens are valid for 12 hours).
- After extended periods between Terraform applies.
- When switching AWS profiles or credentials.