Deploying Collibra Unstructured AI infrastructure

Deploying Unstructured AI infrastructure involves setting up the necessary resources in your cloud environment to support advanced AI applications. This page provides step-by-step instructions for deploying the infrastructure, including prerequisites, configuration details, deployment steps, and optional resources. It also covers troubleshooting common issues and upgrading existing deployments to newer versions.

Tip Deploy Unstructured AI infrastructure in a dedicated AWS subaccount for better resource isolation and management. If you choose to set up a subaccount, follow the AWS Organizations documentation to create a new account under your organization.

Prerequisites

  • You must have the following tools installed on your local machine:
    • Terraform (version 1.12.2 or newer)
    • Optionally, AWS CLI (version 2.23.6 or newer)
  • You must have an AWS account with the necessary permissions to create and manage resources.

    Here is an example of an IAM policy

    Copy
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "CoreInfrastructureAndNetworking",
                "Effect": "Allow",
                "Action": [
                    "ec2:*",
                    "elasticloadbalancing:*",
                    "eks:*",
                    "rds:*",
                    "secretsmanager:*"
                ],
                "Resource": "*",
                "Condition": {
                  "StringEquals": {
                    "aws:ResourceTag/Project": "unstructured"
                  }
                }
            },
            {
                "Sid": "AllowCreationWithProjectTag",
                "Effect": "Allow",
                "Action": [
                    "ec2:*",
                    "elasticloadbalancing:*",
                    "eks:*",
                    "rds:*",
                    "secretsmanager:*"
                ],
                "Resource": "*",
                "Condition": {
                  "StringEquals": {
                    "aws:RequestTag/Project": "unstructured"
                  }
                }
            },
            {
                "Sid": "GlobalDiscoveryActions",
                "Effect": "Allow",
                "Action": [
                    "ec2:Describe*",
                    "eks:Describe*",
                    "eks:List*",
                    "iam:Get*",
                    "iam:List*",
                    "kms:List*",
                    "kms:Describe*",
                    "rds:Describe*",
                    "secretsmanager:ListSecrets",
                    "ssm:GetParameter"
                ],
                "Resource": "*"
            },
            {
                "Sid": "IAM",
                "Effect": "Allow",
                "Action": [
                    "sts:AssumeRole",
                    "iam:CreateOpenIDConnectProvider",
                    "iam:TagOpenIDConnectProvider"
                ],
                "Resource": "*"
            },
            {
                "Sid": "IAMLimited",
                "Effect": "Allow",
                "Action": [
                    "iam:*"
                ],
                "Resource": [
                    "arn:aws:iam::*:role/unstructured-*",
                    "arn:aws:iam::*:policy/unstructured-*",
                    "arn:aws:iam::*:instance-profile/unstructured-*"
                ]
            },
            {
                "Sid": "Misc",
                "Effect": "Allow",
                "Action": [
                    "ec2:*LaunchTemplate*",
                    "ec2:RunInstances",
                    "iam:PassRole",
                    "autoscaling:*",
                    "acm:*",
                    "cognito-idp:DescribeUserPoolClient",
                    "wafv2:*",
                    "waf-regional:*",
                    "shield:*",
                    "route53:*",
                    "ecr:GetAuthorizationToken",
                    "elasticloadbalancing:*" 
                ],
                "Resource": "*"
            },
            {
                "Sid": "S3",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetBucketLocation",
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::unstructured-tf-state",
                    "arn:aws:s3:::unstructured-tf-state/*"
                ]
            },
            {
                "Sid": "KMSKeyManagement",
                "Effect": "Allow",
                "Action": [
                    "kms:CreateKey",
                    "kms:DescribeKey",
                    "kms:GetKeyPolicy",
                    "kms:PutKeyPolicy",
                    "kms:TagResource",
                    "kms:ScheduleKeyDeletion",
                    "kms:ListResourceTags",
                    "kms:CreateAlias",
                    "kms:DeleteAlias",
                    "kms:ListAliases",
                    "kms:ListKeys"
                ],
                "Resource": "*"
            },
            {
                "Sid": "CloudWatchLogsManagement",
                "Effect": "Allow",
                "Action": [
                    "logs:CreateLogGroup",
                    "logs:DescribeLogGroups",
                    "logs:ListTagsForResource",
                    "logs:TagResource",
                    "logs:PutRetentionPolicy",
                    "logs:DeleteLogGroup"
                ],
                "Resource": "*"
            }
        ]
    }
  • An S3 bucket to store the Terraform state file. You can create a new bucket or use an existing one. Update the following in the providers.tf file:
    • bucket = "your-terraform-state-bucket-name"
    • key = "path/to/your/terraform.tfstate"
    • region = "your-aws-region"
  • A Route53 hosted zone for managing DNS records. If you plan to use a custom domain name for accessing the Unstructured AI application, ensure that you have a hosted zone set up in Route53. The zone ID will be required in the terraform.tfvars file.

Bring your own VPN

If you prefer to use your own VPN solution instead of the one provided in this deployment:

  1. Set up your VPN solution according to the provider's documentation.
  2. Ensure that the VPN allows access to the necessary AWS resources. See the /iac/aws/modules/vpn directory for details on what is required.

Deployment steps

  1. Download and extract the Terraform tarball (.tgz) from the Collibra downloads page.
  2. Authenticate with the AWS account where you want to deploy the infrastructure. Use the AWS CLI to configure your credentials:
    Copy
    aws configure
  3. Update the terraform.tfvars file with your desired configuration parameters. Refer to the comments in the file for guidance on each parameter.
  4. Run the Terraform commands to deploy the infrastructure:
    Copy
    terraform init
    terraform apply -target=module.networking -target=module.eks -target=module.rds -target=module.vpn -target=module.iam
  5. Once the EKS cluster is up, provide the worker node group role ARN to the Unstructured AI team so it can be added to the ECR registry policy to allow the cluster to pull images from the private ECR repository. After this, continue with the following Terraform command to apply everything else and ensure your infrastructure is fully in sync with your configuration:
    Copy
    terraform apply
  6. If you did not use a custom VPN, download and connect to the VPN:
    1. Download the aws client VPN from AWS: https://aws.amazon.com/vpn/client-vpn-download/.
    2. Open the VPN client and import the downloaded config file: File → Add Profile → select the .ovpn file that was generated in the /modules/vpn module.
    3. Connect to the VPN using the created profile.
  7. Verify the deployment was successful:
    • Access the Unstructured AI application through the provided domain name to ensure you can reach the home page.
    • Use kubectl to ensure all pods in the EKS cluster are running as expected:
      Copy
      aws eks --region <your-region> update-kubeconfig --name <your-cluster-name>
      kubectl get pods --all-namespaces

Optional EKS resources

The following resources are optional and you can include them in your deployment based on your requirements:

OpenTelemetry
For monitoring the health and performance of your infrastructure.
Cluster Autoscaler
Automatically adjusts the number of nodes in your EKS cluster based on workload demands.
linkerd
A service mesh to manage microservices communication in your EKS cluster.
cert-manager
Manages TLS certificates in your Kubernetes cluster.
linkerd-certs
Creates the necessary certificates for linkerd and includes a cron job to rotate them monthly.

Upgrading Unstructured AI infrastructure

To upgrade your Unstructured AI infrastructure to a newer version:

  1. Download and extract the latest Terraform TGZ tarball from the Collibra downloads page.
  2. Review the release notes for any breaking changes or important information regarding the upgrade.
  3. Update your existing terraform.tfvars file with any new parameters or changes introduced in the new version.
  4. Run the following Terraform commands to apply the upgrade:
    Copy
    terraform init
    terraform apply

Update Helm installation with new parameters

To update the Helm installation with new parameters:

  1. Update your existing terraform.tfvars file with any new parameters or changes introduced in the new version.
  2. Run the following command to apply the changes:
    Copy
    terraform apply

Troubleshooting Helm provider OCI registry authentication errors

There is a known issue caused by a bug in Helm provider version 3.x where the repository_password is cached in Terraform state. When ECR authorization tokens expire after 12 hours, Terraform continues to use the expired token from state instead of fetching a fresh token from the data.aws_ecr_authorization_token data source.

The error message is:

Error: OCI Registry Login Failed
Failed to log in to OCI registry "oci://...": response status code 403: denied: Your authorization token has expired. Reauthenticate and try again.

Steps to resolve

The workaround involves removing the affected Helm releases from Terraform state and re-importing them. This clears the cached expired credentials and allows Terraform to use fresh authentication tokens.

Remove and re-import all Helm releases that pull from ECR:

Copy
terraform state rm module.frontend.helm_release.frontend
terraform import module.frontend.helm_release.frontend unstructured/unstructured-frontend

terraform state rm module.backend.helm_release.backend
terraform import module.backend.helm_release.backend unstructured/unstructured-backend

terraform state rm module.linkerd_certs.helm_release.linkerd_certs
terraform import module.linkerd_certs.helm_release.linkerd_certs linkerd/linkerd-certs

terraform state rm module.linkerd_certs.helm_release.cert_manager
terraform import module.linkerd_certs.helm_release.cert_manager cert-manager/cert-manager

terraform state rm module.linkerd_crds.helm_release.linkerd_crds
terraform import module.linkerd_crds.helm_release.linkerd_crds linkerd/linkerd-crds

terraform state rm module.linkerd.helm_release.linkerd
terraform import module.linkerd.helm_release.linkerd linkerd/linkerd

terraform state rm module.ingress.helm_release.aws_load_balancer_controller
terraform import module.ingress.helm_release.aws_load_balancer_controller kube-system/aws-load-balancer-controller

terraform state rm module.eks_workload_addons.helm_release.argo_events
terraform import module.eks_workload_addons.helm_release.argo_events unstructured/argo-events

terraform state rm module.eks_workload_addons.helm_release.argo_workflows
terraform import module.eks_workload_addons.helm_release.argo_workflows unstructured/argo-workflows

terraform state rm module.eks_workload_addons.helm_release.external_secrets_operator
terraform import module.eks_workload_addons.helm_release.external_secrets_operator unstructured/external-secrets

terraform state rm module.eks_workload_addons.helm_release.otel_collector
terraform import module.eks_workload_addons.helm_release.otel_collector unstructured/otel-collector

terraform state rm module.eks_workload_addons.helm_release.aws_ebs_csi_driver
terraform import module.eks_workload_addons.helm_release.aws_ebs_csi_driver kube-system/aws-ebs-csi-driver

terraform state rm module.eks_workload_addons.helm_release.cluster-autoscaler
terraform import module.eks_workload_addons.helm_release.cluster-autoscaler kube-system/cluster-autoscaler
Copy
terraform apply

When this occurs

  • After ECR authorization tokens expire (tokens are valid for 12 hours).
  • After extended periods between Terraform applies.
  • When switching AWS profiles or credentials.

Related issues