Azure Purview's connectors enable you to scan many different data sources. Sometimes they have a schema. For file based data sources (e.g. a storage account in Azure or AWS) they will have a TabularSchema
attached to the ResourceSet
that represents a collection of files that were scanned. In addition, this tabular schema object can be attached to any data set to give them a schema.
However, this often causes angst since normally you can define a custom type in Purview / Atlas and use the Purview type options to control the schema.
Let's dive into how to extract columns, update columns, and add columns to a tabular schema.
Assuming you've installed PyApacheAtlas to query Purview, we can create a client object to query a given asset.
You'll need either the qualified name and type or guid. You can get the guid by extracting it from the browser of the asset you want to pull down. Qualified name can be extracted from the
# After generating the PurviewClient (named client)
# we can get our entity by guid or qualifiedname/typeName
some_table_entity = client.get_entity(guid="abc-123-def")["entities"][0]
# Now we need the tabular schema guid to be able to query for its columns
tabular_schema_guid = some_table_entity["relationshipAttributes"]["tabular_schema"]["guid"]
tabular_schema_entity = client.get_entity(tabular_schema_guid)["entities"][0]
# With the tabular schema entity pulled, we now have a reference for
# each of the columns in the tabular schema
for column in tabular_schema_entity["relationshipAttributes"]["columns"]:
print(column)
The script above performs the following steps:
client.get_entity
and extract the first entity from the results.tabular_schema
relationship attribute. This is a pointer to the tabular_schema object (assuming it's defined).Assuming you're updating a couple column descriptions, you can use the PurviewClient.partial_update_entity
to iteratively update each one. The easiest way to do this is to grab the qualified name of the column or columns you want to update.
The type of the column is always "column" which is convenient!
# Create a dictionary that has a key of the qualified name
# of the column and a value that is a dictionary with one
# or many keys that represent the attribute name to update
# and the values represent the updated values you want to
# apply
column_updates = {
"qualifiedName1": {"description": "my new description"},
"qualifiedName2": {"description": "my other new description"},
}
for column_qn in column_updates:
# Assuming you've already created the PurviewClient
client.partial_update_entity(
qualifiedName = column_qn,
typeName = "column",
attributes = column_updates[column_qn]
)
Perhaps the scanning didn't quite catch all of the columns of your data?
You can add a column to an existing tabular schema but you probably need to follow the same steps above to get the tabular schema guid.
# After generating the PurviewClient (named client)
# we can get our entity by guid or qualifiedname/typeName
some_table_entity = client.get_entity(guid="abc-123-def")["entities"][0]
# Now we need the tabular schema guid to be able to query for its columns
tabular_schema_guid = some_table_entity["relationshipAttributes"]["tabular_schema"]["guid"]
# Now we have to create an Atlas Entity to represent our additional column
column = AtlasEntity(
"my_custom_column",
"column",
qualified_name = tabular_schema_entity["attributes"]["qualifiedName"] + "#my_custom_column",
guid="-1",
attributes = {
"type":"string",
"description": "This is my column added via the API"
}
)
column.addRelationship(composeSchema={"guid":tabular_schema_guid})
results = client.upload_entities([column])
print(results)
In this script, we need the guid or qualified name and type of the table resource that "contains" the schema. We use that entity to find its tabular schema. With the tabular schema guid we can add it as a relationship to our new column entity we create with the AtlasEntity
class.
Then we upload the new column (or columns) using client.upload_entities
and pass in a list of AtlasEntities. You've now created a new column and added the connection (relationship) to the tabular schema.
Working with ResourceSets requires you to understand the tabular schema object attached to the resource sets. To update, add, or extract the columns from a resource set, you have to...
PurviewClient.get_entity
on your resource set to find the relationship to the tabular_schema object.