Bulk Upload or Update Entities with Excel for Purview and Apache Atlas

If you're starting your adventure with Azure Purview or Apache Atlas, you'll notice that there aren't built in connectors, hooks, or bridges for every data source. Instead, you'll have to provide some development resources with time to build that connection. Unfortunately, not everyone has those kind of resources ready and available.

Starting with a simple spreadsheet
Adding Attributes
Adding Connections to Other Entities (Relationships)
- Referencing Entities Not Being Uploaded with AtlasObjectId
- Relating Multiple Entities in One Relationship Attribute
Adding Glossary Terms
Adding Classifications
Adding experts and owners
Adding custom Attributes
Adding Labels for Apache Atlas
Recap

Instead, you can leverage PyApacheAtlas and its Excel features to fill in a spreadsheet in a certain way and then push the content of the spreadsheet, in the right format, to Purview or Atlas' REST API. This lets you move faster with getting your metadata into your service, reduce the amount of code that needs to be written, and gives this power to business users and not just IT.

Starting with a simple spreadsheet

typeName	qualifiedName	name
DataSet	custom://unique-data-source	My Data Source

Save the above content to an excel spreadsheet on a tab called BulkEntities. This is the absolute minimum necessary to do an upload if the type of entity has no additional required attributes.

Next, we can upload the contents using this snippet. Be sure to have installed and configured PyApacheAtlas and include the relevant client (PurviewClient for Purview and AtlasClient for Apache Atlas).

Here's an example using Azure Purview and the Azure CLI.

import json
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core.client import PurviewClient
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader

auth = ServicePrincipalAuthentication(
    tenant_id = "replace_with_tenant_id",
    client_id = "replace_with_client_id",
    client_secret = "replace_with_client_secret"
)
client = PurviewClient(
    account_name= "PurviewAccountName",
    authentication = auth
)

ec = ExcelConfiguration()
reader = ExcelReader(ec)

entities = reader.parse_bulk_entities("path/to/spreadsheet.xlsx")

results = client.upload_entities(entities)

print(json.dumps(results, indent=2))

Assuming you've authenticated properly, you will have uploaded one entity to Purview!

What happened in this script?

PurviewClient and ServicePrincipalAuthentication: These are necessary boilerplate to set up communication with Azure Purview. Alternatively, you could use azure-identity and the AzureCliCredential for even less code having to be written.
ExcelConfiguration and ExcelReader: These define methods for extracting entities from the standard excel template. You can customize the ExcelConfiguration if you don't want to use the standard template.
parse_bulk_entities: This method reads from, by default, the BulkEntities tab and converts what you've provided into Atlas Entities ready to be uploaded to Azure Purview or Apache Atlas. Note: This method does NOT do an upload, it merely parses the spreadsheet tab.
upload_entities: This method takes the entities we extracted with parse_bulk_entities and does the actual upload to your Atlas or Purview service.
print(json.dumps()): This print statement takes the output of our upload and prints it nicely with some indentation. json.dumps turns a Python dictionary and turns it into a string. The indent=2 tells Python to add two spaces for each level in the resulting json.

Adding Attributes

The first table only showed the required fields. If you add any other column headers and fill in a cell within that column PyApacheAtlas will automatically add that attribute to the entity on the given row of the cell. Here's an example where we add a description.

typeName	qualifiedName	name	description	type
DataSet	custom://unique-data-source	My Data Source	This is my cool description
column	custom://unique-column	My Custom Column	This is my cool column	integer

We've added a second row that will create a column type entity. In addition we've added a description column and a type column. Description is applicable to every entity in Apache Atlas and Purview. Type is applicable to the column type (it's a required attribute) but it's not applicable to a DataSet type. Leaving the field blank will prevent it from being included for that row.

Try to update your spreadsheet we worked on earlier and re-run the script and see the results!

A common follow-up question, especially for built-in types, is how do you know which attributes are required? There are two approaches:

Attempt the upload and the error message will tell you which attribute is missing.
Extract the type definitions with client.get_all_typedefs() and save the results to a json file and manually determine which attributes are required.

Adding Connections to Other Entities (Relationships)

Now let's say you wanted to take advantage of Purview's / Atlas' ability to connect entities to each other through the "relationship" feature (learn more about custom types and relationships in Purview and Atlas). PyApacheAtlas supports the ability to connect entities being uploaded at the same time.

typeName	qualifiedName	name	data_type	`[Relationship] db`
hive_table	custom://my/custom/hivetable	My Data Source
hive_column	custom://my/custom/hivetable#col01	col01	integer	custom://my/custom/hivetable

In this table, we added a [Relationship] db column that takes the qualified name of an asset that was already mentioned in the spreadsheet. The [Relationship] prefix is a special keyword for PyApacheAtlas that helps the package to know where to send this attribute since Atlas has both "Attributes" and "Relationship Attributes" which could possibly be named the same.

Where did db come from? You have to be familiar with the type you are working with. By calling a method like client.get_all_typedefs() I can see how hive_column is defined and see that it has a relationship attribute called db that points to a hive_table (i.e. a column points to one database and a database points to many columns).

When this BulkEntities tab is parsed, the hive_column entity will now have a relationship attribute called db filled in with a pointer (specifically an AtlasObjectId object) to the hive_table being created in this upload.

An important note: This is not an append operation. Providing a value here will overwrite the relationship attribute for the given entity.

Referencing Entities Not Being Uploaded with AtlasObjectId

Some entity types have required relationships (like an Azure SQL table requires an Azure SQL Schema). You don't want to have to specify a db schema because that would require also specifying a database and that requires specifying a server!

To avoid this dependency chain, you can specify a relationship column header but provide a special value: the AtlasObjectId. The cell's value should be one of:

AtlasObjectId(guid:xx-yy-zzz)
AtlasObjectId(typeName:nameOfType qualifiedName:myQualifiedName)
- There MUST be a space between the value of the type name and the keyword "qualifiedName".
- There MAY be any number of spaces in the value of typeName and qualifiedName.
- There MUST NOT be a space anywhere else.

Here's an example sheet using the guid of the related azure_sql_schema entity.

typeName	qualifiedName	name	`[Relationship] dbSchema`
azure_sql_table	mssql://server/db/schema/MyTableName	MyTableName	AtlasObjectId(guid:123-abc-456)

Alternatively, I can specify the typeName and qualifiedName

typeName	qualifiedName	name	`[Relationship] dbSchema`
azure_sql_table	mssql://server/db/schema/MyTableName	MyTableName	AtlasObjectId(typeName:azure_sql_schema qualifiedName:mssql://server/db/schema)

Following these examples, you can apply this to any relationship attribute (other than meanings) and have a simpler upload when working with existing objects and their related entities.

Relating Multiple Entities in One Relationship Attribute

The [Relationship] columns support the configurable delimiter (semi-colon (;) by default).

For example, using the AtlasObjectId, you want to upload a table and associate multiple columns to that table. This example takes two columns you're creating at upload time and associates an existing column based on guid.

When referencing an entity in the spreadsheet as a relationship, it must appear before hand in the spreadsheet.
This is a contrived example, and a better way of doing this would be to specify the [Relationship] table for each column.
When your relationship attribute expects an array type, you should use the configured delimiter to force it into an array.

typeName	qualifiedName	name	`[Relationship] columns`
hive_table	hive://tableA#col01	col01
hive_table	hive://tableA#col02	col02
hive_table	hive://tableA	tableA	hive://tableA#col01;hive://tableA#col01;AtlasObectId(guid:xx-yy-zz)

Adding Glossary Terms

typeName	qualifiedName	name	`[Relationship] meanings`
DataSet	custom://my/custom/datasetwithmeanings	entity with meanings	my term;my other term

The "meanings" relationship attribute can be provided a set of glossary terms (default delimiter is semi-colon (;)). For Apache Atlas users, an important note is that this glossary term is hard coded to be the default Glossary (where the terms qualified names will resolve to my term@Glossary).

An important note: you can only apply meanings to new entities. To update an entity with new terms, you'll need to call the assignTerm endpoint.

Adding Classifications

typeName	qualifiedName	name	`[root] classifications`
DataSet	custom://my/custom/datasetwithclassifs	entity with classifs	MICROSOFT.IPADDRESS;MICROSOFT.SOMETHING.ELSE

An important note: you can only apply classifications in the spreadsheet to new entities. To update entities and their classifications, you need to call another endpoint using classify_bulk_entities or classify_entity.

Adding experts and owners

Two more special column headers, experts and owners will take AAD Object Ids (by default) and associate them with your entity. However, most users don't know or don't know how to look up AAD Object Ids. Thankfully, PyApacheAtlas provides a utility to lookup AAD Object Ids based on User Principal Name or primary email address.

For example, assuming you had a spreadsheet that looked like this:

typeName	qualifiedName	name	experts
DataSet	custom://my/custom/ds	expert ds	bill@example.com;will@example.com

Our starter script from above would need to make the below change to the parse_bulk_entities call.

entities = reader.parse_bulk_entities(
    "path/to/spreadsheet.xlsx",
    contacts_func = client.msgraph.email_to_id
)

The function passed into contacts_func will get executed on every expert or owner provided. In this case, the PurviewClient.msgraph.email_to_id method will take an email address, query it against the Microsoft Graph API and then return the id. To avoid unnecessary calls to the function you provide, it has a built-in dictionary that stores the results for the given parsing.

Adding custom Attributes

typeName	qualifiedName	name	`[custom] attrib1`	`[custom] attrib2`
DataSet	custom://my/custom/datasetwithcustom	entity with custom stuff	string1	string2

For Apache Atlas users, the excel spreadsheet supports adding one or many custom attributes to the entity.

An important note: you can only custom attributes to new entities. To update entities, you need to call another endpoint.

Adding Labels for Apache Atlas

typeName	qualifiedName	name	`[root] labels`
DataSet	custom://my/custom/datasetwithlabels	entity with classifs	labelA;labelB

For Apache Atlas users, the excel spreadsheet supports adding one or many labels to the entity.

An important note: you can only apply labels to new entities. To update entities, you need to call another endpoint.

Recap

You now have seen how to take advantage of the BulkEntities tab to:

Create entities with the minimum requirements: typeName, qualifiedName, and name.
Include / update additional attributes by adding the attribute in the column headers.
Add glossary terms by adding the [Relationship] meanings as a column header.
Connect entities together by using the [Relationship] prefix in front of the relationship attributes name and providing the qualified name of the entity mentioned earlier in the spreadsheet.
Add classifications by adding the [root] classifications as a column header.
Add labels (for Apache Atlas users) by adding the [root] labels as a column header.