Skip to content

Commit c071367

Browse files
authored
feat(glue): support partition index on tables (#17998)
This PR adds support for creating partition indexes on tables via custom resources. It offers two different ways to create indexes: ```ts // via table definition const table = new glue.Table(this, 'Table', { database, bucket, tableName: 'table', columns, partitionKeys, partitionIndexes: [{ indexName: 'my-index', keyNames: ['month'], }], dataFormat: glue.DataFormat.CSV, }); ``` ```ts // or as a function table.AddPartitionIndex([{ indexName: 'my-other-index', keyNames: ['month', 'year'], }); ``` I also refactored the format of some tests, which is what accounts for the large diff in `test.table.ts`. Motivation: Creating partition indexes on a table is something you can do via the console, but is not an exposed property in cloudformation. In this case, I think it makes sense to support this feature via custom resources as it will significantly reduce the customer pain of either provisioning a custom resource with correct permissions or manually going into the console after resource creation. Supporting this feature allows for synth-time checks and dependency chaining for multiple indexes (reason detailed in the FAQ) which removes a rather sharp edge for users provisioning custom resource indexes themselves. FAQ: Why do we need to chain dependencies between different Partition Index Custom Resources? - Because Glue only allows 1 index to be created or deleted simultaneously per table. Without dependencies the resources will try to create partition indexes simultaneously and the second sdk call with be dropped. Why is it called `partitionIndexes`? Is that really how you pluralize index? - [Yesish](https://www.nasdaq.com/articles/indexes-or-indices-whats-the-deal-2016-05-12). If you hate it it can be `partitionIndices`. Why is `keyNames` of type `string[]` and not `Column[]`? `PartitionKey` is of type `Column[]` and partition indexes must be a subset of partition keys... - This could be a debate. But my argument is that the pattern I see for defining a Table is to define partition keys inline and not declare them each as variables. It would be pretty clunky from a UX perspective: ```ts const key1 = { name: 'mykey', type: glue.Schema.STRING }; const key2 = { name: 'mykey2', type: glue.Schema.STRING }; const key3 = { name: 'mykey3', type: glue.Schema.STRING }; new glue.Table(this, 'table', { database, bucket, tableName: 'table', columns, partitionKeys: [key1, key2, key3], partitionIndexes: [key1, key2], dataFormat: glue.DataFormat.CSV, }); ``` Why are there 2 different checks for having > 3 partition indexes? - It's possible someone decides to define 3 indexes in the definition and then try to add another with `table.addPartitionIndex()`. This would be a nasty deploy time error, its better if it is synth time. It's also possible someone decides to define 4 indexes in the definition. It's better to fast-fail here before we create 3 custom resources. What if I deploy a table, manually add 3 partition indexes, and then try to call `table.addPartitionIndex()` and update the stack? Will that still be a synth time failure? - Sorry, no. Why do we need to generate names? - We don't. I just thought it would be helpful. Why is `grantToUnderlyingResources` public? - I thought it would be helpful. Some permissions need to be added to the table, the database, and the catalog. Closes #17589. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
1 parent aa51b6c commit c071367

File tree

7 files changed

+1379
-420
lines changed

7 files changed

+1379
-420
lines changed

packages/@aws-cdk/aws-glue/README.md

+47-1
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ new glue.Table(this, 'MyTable', {
194194

195195
By default, an S3 bucket will be created to store the table's data and stored in the bucket root. You can also manually pass the `bucket` and `s3Prefix`:
196196

197-
### Partitions
197+
### Partition Keys
198198

199199
To improve query performance, a table can specify `partitionKeys` on which data is stored and queried separately. For example, you might partition a table by `year` and `month` to optimize queries based on a time window:
200200

@@ -218,6 +218,52 @@ new glue.Table(this, 'MyTable', {
218218
});
219219
```
220220

221+
### Partition Indexes
222+
223+
Another way to improve query performance is to specify partition indexes. If no partition indexes are
224+
present on the table, AWS Glue loads all partitions of the table and filters the loaded partitions using
225+
the query expression. The query takes more time to run as the number of partitions increase. With an
226+
index, the query will try to fetch a subset of the partitions instead of loading all partitions of the
227+
table.
228+
229+
The keys of a partition index must be a subset of the partition keys of the table. You can have a
230+
maximum of 3 partition indexes per table. To specify a partition index, you can use the `partitionIndexes`
231+
property:
232+
233+
```ts
234+
declare const myDatabase: glue.Database;
235+
new glue.Table(this, 'MyTable', {
236+
database: myDatabase,
237+
tableName: 'my_table',
238+
columns: [{
239+
name: 'col1',
240+
type: glue.Schema.STRING,
241+
}],
242+
partitionKeys: [{
243+
name: 'year',
244+
type: glue.Schema.SMALL_INT,
245+
}, {
246+
name: 'month',
247+
type: glue.Schema.SMALL_INT,
248+
}],
249+
partitionIndexes: [{
250+
indexName: 'my-index', // optional
251+
keyNames: ['year'],
252+
}], // supply up to 3 indexes
253+
dataFormat: glue.DataFormat.JSON,
254+
});
255+
```
256+
257+
Alternatively, you can call the `addPartitionIndex()` function on a table:
258+
259+
```ts
260+
declare const myTable: glue.Table;
261+
myTable.addPartitionIndex({
262+
indexName: 'my-index',
263+
keyNames: ['year'],
264+
});
265+
```
266+
221267
## [Encryption](https://docs.aws.amazon.com/athena/latest/ug/encryption.html)
222268

223269
You can enable encryption on a Table's data:

packages/@aws-cdk/aws-glue/lib/table.ts

+130-2
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,33 @@
11
import * as iam from '@aws-cdk/aws-iam';
22
import * as kms from '@aws-cdk/aws-kms';
33
import * as s3 from '@aws-cdk/aws-s3';
4-
import { ArnFormat, Fn, IResource, Resource, Stack } from '@aws-cdk/core';
4+
import { ArnFormat, Fn, IResource, Names, Resource, Stack } from '@aws-cdk/core';
5+
import * as cr from '@aws-cdk/custom-resources';
6+
import { AwsCustomResource } from '@aws-cdk/custom-resources';
57
import { Construct } from 'constructs';
68
import { DataFormat } from './data-format';
79
import { IDatabase } from './database';
810
import { CfnTable } from './glue.generated';
911
import { Column } from './schema';
1012

13+
/**
14+
* Properties of a Partition Index.
15+
*/
16+
export interface PartitionIndex {
17+
/**
18+
* The name of the partition index.
19+
*
20+
* @default - a name will be generated for you.
21+
*/
22+
readonly indexName?: string;
23+
24+
/**
25+
* The partition key names that comprise the partition
26+
* index. The names must correspond to a name in the
27+
* table's partition keys.
28+
*/
29+
readonly keyNames: string[];
30+
}
1131
export interface ITable extends IResource {
1232
/**
1333
* @attribute
@@ -102,7 +122,16 @@ export interface TableProps {
102122
*
103123
* @default table is not partitioned
104124
*/
105-
readonly partitionKeys?: Column[]
125+
readonly partitionKeys?: Column[];
126+
127+
/**
128+
* Partition indexes on the table. A maximum of 3 indexes
129+
* are allowed on a table. Keys in the index must be part
130+
* of the table's partition keys.
131+
*
132+
* @default table has no partition indexes
133+
*/
134+
readonly partitionIndexes?: PartitionIndex[];
106135

107136
/**
108137
* Storage type of the table's data.
@@ -230,6 +259,18 @@ export class Table extends Resource implements ITable {
230259
*/
231260
public readonly partitionKeys?: Column[];
232261

262+
/**
263+
* This table's partition indexes.
264+
*/
265+
public readonly partitionIndexes?: PartitionIndex[];
266+
267+
/**
268+
* Partition indexes must be created one at a time. To avoid
269+
* race conditions, we store the resource and add dependencies
270+
* each time a new partition index is created.
271+
*/
272+
private partitionIndexCustomResources: AwsCustomResource[] = [];
273+
233274
constructor(scope: Construct, id: string, props: TableProps) {
234275
super(scope, id, {
235276
physicalName: props.tableName,
@@ -287,6 +328,77 @@ export class Table extends Resource implements ITable {
287328
resourceName: `${this.database.databaseName}/${this.tableName}`,
288329
});
289330
this.node.defaultChild = tableResource;
331+
332+
// Partition index creation relies on created table.
333+
if (props.partitionIndexes) {
334+
this.partitionIndexes = props.partitionIndexes;
335+
this.partitionIndexes.forEach((index) => this.addPartitionIndex(index));
336+
}
337+
}
338+
339+
/**
340+
* Add a partition index to the table. You can have a maximum of 3 partition
341+
* indexes to a table. Partition index keys must be a subset of the table's
342+
* partition keys.
343+
*
344+
* @see https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html
345+
*/
346+
public addPartitionIndex(index: PartitionIndex) {
347+
const numPartitions = this.partitionIndexCustomResources.length;
348+
if (numPartitions >= 3) {
349+
throw new Error('Maximum number of partition indexes allowed is 3');
350+
}
351+
this.validatePartitionIndex(index);
352+
353+
const indexName = index.indexName ?? this.generateIndexName(index.keyNames);
354+
const partitionIndexCustomResource = new cr.AwsCustomResource(this, `partition-index-${indexName}`, {
355+
onCreate: {
356+
service: 'Glue',
357+
action: 'createPartitionIndex',
358+
parameters: {
359+
DatabaseName: this.database.databaseName,
360+
TableName: this.tableName,
361+
PartitionIndex: {
362+
IndexName: indexName,
363+
Keys: index.keyNames,
364+
},
365+
},
366+
physicalResourceId: cr.PhysicalResourceId.of(
367+
indexName,
368+
),
369+
},
370+
policy: cr.AwsCustomResourcePolicy.fromSdkCalls({
371+
resources: cr.AwsCustomResourcePolicy.ANY_RESOURCE,
372+
}),
373+
});
374+
this.grantToUnderlyingResources(partitionIndexCustomResource, ['glue:UpdateTable']);
375+
376+
// Depend on previous partition index if possible, to avoid race condition
377+
if (numPartitions > 0) {
378+
this.partitionIndexCustomResources[numPartitions-1].node.addDependency(partitionIndexCustomResource);
379+
}
380+
this.partitionIndexCustomResources.push(partitionIndexCustomResource);
381+
}
382+
383+
private generateIndexName(keys: string[]): string {
384+
const prefix = keys.join('-') + '-';
385+
const uniqueId = Names.uniqueId(this);
386+
const maxIndexLength = 80; // arbitrarily specified
387+
const startIndex = Math.max(0, uniqueId.length - (maxIndexLength - prefix.length));
388+
return prefix + uniqueId.substring(startIndex);
389+
}
390+
391+
private validatePartitionIndex(index: PartitionIndex) {
392+
if (index.indexName !== undefined && (index.indexName.length < 1 || index.indexName.length > 255)) {
393+
throw new Error(`Index name must be between 1 and 255 characters, but got ${index.indexName.length}`);
394+
}
395+
if (!this.partitionKeys || this.partitionKeys.length === 0) {
396+
throw new Error('The table must have partition keys to create a partition index');
397+
}
398+
const keyNames = this.partitionKeys.map(pk => pk.name);
399+
if (!index.keyNames.every(k => keyNames.includes(k))) {
400+
throw new Error(`All index keys must also be partition keys. Got ${index.keyNames} but partition key names are ${keyNames}`);
401+
}
290402
}
291403

292404
/**
@@ -336,6 +448,22 @@ export class Table extends Resource implements ITable {
336448
});
337449
}
338450

451+
/**
452+
* Grant the given identity custom permissions to ALL underlying resources of the table.
453+
* Permissions will be granted to the catalog, the database, and the table.
454+
*/
455+
public grantToUnderlyingResources(grantee: iam.IGrantable, actions: string[]) {
456+
return iam.Grant.addToPrincipal({
457+
grantee,
458+
resourceArns: [
459+
this.tableArn,
460+
this.database.catalogArn,
461+
this.database.databaseArn,
462+
],
463+
actions,
464+
});
465+
}
466+
339467
private getS3PrefixForGrant() {
340468
return this.s3Prefix + '*';
341469
}

packages/@aws-cdk/aws-glue/package.json

+2
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@
9999
"@aws-cdk/aws-s3": "0.0.0",
100100
"@aws-cdk/aws-s3-assets": "0.0.0",
101101
"@aws-cdk/core": "0.0.0",
102+
"@aws-cdk/custom-resources": "0.0.0",
102103
"constructs": "^3.3.69"
103104
},
104105
"homepage": "https://github.com/aws/aws-cdk",
@@ -113,6 +114,7 @@
113114
"@aws-cdk/aws-s3": "0.0.0",
114115
"@aws-cdk/aws-s3-assets": "0.0.0",
115116
"@aws-cdk/core": "0.0.0",
117+
"@aws-cdk/custom-resources": "0.0.0",
116118
"constructs": "^3.3.69"
117119
},
118120
"engines": {

0 commit comments

Comments
 (0)