DQ Job S3
S3 permissions need to be setup appropriately.
Note S3 connections should be defined using the root bucket. Nested S3 connections are not supported.
Example Minimum Permissions
Copy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:ListBucketMultipartUploads",
"s3:ListBucket",
"s3:ListMultipartUploadParts",
"s3:GetObject",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:athena:*:<AWSAccountID>:workgroup/primary",
"arn:aws:s3:::<S3 bucket name>/*",
"arn:aws:s3:::<S3 bucket name>",
"arn:aws:glue:*:<AWSAccountID>:catalog",
"arn:aws:glue:*:<AWSAccountID>:database/<database name>",
"arn:aws:glue:*:<AWSAccountID>:table/<database name>/*"
]
}
]
}
(Needs appropriate driver) http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/ Hadoop AWS Driver hadoop-aws-2.7.3.2.6.5.0-292.jar
Copy
-f "s3a://s3-location/testfile.csv" \
-d "," \
-rd "2018-01-08" \
-ds "salary_data_s3" \
-deploymode client \
-lib /home/ec2-user/owl/drivers/aws/
Databricks Utils Or Spark Conf
Copy
val AccessKey = "xxx"
val SecretKey = "xxxyyyzzz"
//val EncodedSecretKey = SecretKey.replace("/", "%2F")
val AwsBucketName = "s3-location"
val MountName = "kirk"
dbutils.fs.unmount(s"/mnt/$MountName")
dbutils.fs.mount(s"s3a://${AccessKey}:${SecretKey}@${AwsBucketName}", s"/mnt/$MountName")
//display(dbutils.fs.ls(s"/mnt/$MountName"))
//sse-s3 example
dbutils.fs.mount(s"s3a://$AccessKey:$SecretKey@$AwsBucketName", s"/mnt/$MountName", "sse-s3")
Databricks Notebooks using S3 buckets
Copy
val AccessKey = "ABCDED"
val SecretKey = "aaasdfwerwerasdfB"
val EncodedSecretKey = SecretKey.replace("/", "%2F")
val AwsBucketName = "s3-location"
val MountName = "abc"
// bug if you don't unmount first
dbutils.fs.unmount(s"/mnt/$MountName")
// mount the s3 bucket
dbutils.fs.mount(s"s3a://${AccessKey}:${EncodedSecretKey}@${AwsBucketName}", s"/mnt/$MountName")
display(dbutils.fs.ls(s"/mnt/$MountName"))
// read the dataframe
val df = spark.read.text(s"/mnt/$MountName/atm_customer/atm_customer_2019_01_28.csv")