Update a Purview Resource Set's Tabular Schema

A tabular schema entity has an associated DataSet and a set of columns. A column has a composeSchema that consists of a single tabular schema.

Azure Purview's connectors enable you to scan many different data sources. Sometimes they have a schema. For file based data sources (e.g. a storage account in Azure or AWS) they will have a TabularSchema attached to the ResourceSet that represents a collection of files that were scanned. In addition, this tabular schema object can be attached to any data set to give them a schema.

However, this often causes angst since normally you can define a custom type in Purview / Atlas and use the Purview type options to control the schema.

Let's dive into how to extract columns, update columns, and add columns to a tabular schema.

Get Columns from a Tabular Schema
Update a Column in a Resource Set / Tabular Schema
Adding a Column to a Resource Set / Tabular Schema
Recap

Get Columns from a Tabular Schema

Assuming you've installed PyApacheAtlas to query Purview, we can create a client object to query a given asset.

You'll need either the qualified name and type or guid. You can get the guid by extracting it from the browser of the asset you want to pull down. Qualified name can be extracted from the

# After generating the PurviewClient (named client)
# we can get our entity by guid or qualifiedname/typeName
some_table_entity = client.get_entity(guid="abc-123-def")["entities"][0]

# Now we need the tabular schema guid to be able to query for its columns
tabular_schema_guid = some_table_entity["relationshipAttributes"]["tabular_schema"]["guid"]
tabular_schema_entity = client.get_entity(tabular_schema_guid)["entities"][0]

# With the tabular schema entity pulled, we now have a reference for
# each of the columns in the tabular schema
for column in tabular_schema_entity["relationshipAttributes"]["columns"]:
    print(column)

The script above performs the following steps:

After creating a client object that can connect to your Purview instance...
Query a specific guid using client.get_entity and extract the first entity from the results.
Once you've got the entity, you can grab the tabular_schema relationship attribute. This is a pointer to the tabular_schema object (assuming it's defined).
We then grab the guid of the tabular_schema object and query THAT guid.
The tabular_schame object is extracted into the tabular_schema_entity variable.
Finally, we can iterate over all of the columns and print them or the specific attributes out.

Update a Column in a Resource Set / Tabular Schema

Assuming you're updating a couple column descriptions, you can use the PurviewClient.partial_update_entity to iteratively update each one. The easiest way to do this is to grab the qualified name of the column or columns you want to update.

The type of the column is always "column" which is convenient!

# Create a dictionary that has a key of the qualified name
# of the column and a value that is a dictionary with one
# or many keys that represent the attribute name to update
# and the values represent the updated values you want to
# apply
column_updates = {
    "qualifiedName1": {"description": "my new description"},
    "qualifiedName2": {"description": "my other new description"},
}

for column_qn in column_updates:
    # Assuming you've already created the PurviewClient
    client.partial_update_entity(
        qualifiedName = column_qn,
        typeName = "column",
        attributes = column_updates[column_qn]
    )

Adding a Column to a Resource Set / Tabular Schema

Perhaps the scanning didn't quite catch all of the columns of your data?

You can add a column to an existing tabular schema but you probably need to follow the same steps above to get the tabular schema guid.

# After generating the PurviewClient (named client)
# we can get our entity by guid or qualifiedname/typeName
some_table_entity = client.get_entity(guid="abc-123-def")["entities"][0]

# Now we need the tabular schema guid to be able to query for its columns
tabular_schema_guid = some_table_entity["relationshipAttributes"]["tabular_schema"]["guid"]

# Now we have to create an Atlas Entity to represent our additional column
column = AtlasEntity(
    "my_custom_column",
    "column",
    qualified_name = tabular_schema_entity["attributes"]["qualifiedName"] + "#my_custom_column",
    guid="-1",
    attributes = {
        "type":"string",
        "description": "This is my column added via the API"
    }
)
column.addRelationship(composeSchema={"guid":tabular_schema_guid})

results = client.upload_entities([column])

print(results)

In this script, we need the guid or qualified name and type of the table resource that "contains" the schema. We use that entity to find its tabular schema. With the tabular schema guid we can add it as a relationship to our new column entity we create with the AtlasEntity class.

Then we upload the new column (or columns) using client.upload_entities and pass in a list of AtlasEntities. You've now created a new column and added the connection (relationship) to the tabular schema.

Recap

Working with ResourceSets requires you to understand the tabular schema object attached to the resource sets. To update, add, or extract the columns from a resource set, you have to...

Call PurviewClient.get_entity on your resource set to find the relationship to the tabular_schema object.
Use the tabular_schema object's guid in a column entity's composeSchema relationship attribute when adding a column.
Use the tabular_schema's columns relationship attribute to extract all of the column entities.