Home OSS About Privacy

Bulk Upload or Update Entities with Excel for Purview and Apache Atlas

If you're starting your adventure with Azure Purview or Apache Atlas, you'll notice that there aren't built in connectors, hooks, or bridges for every data source. Instead, you'll have to provide some development resources with time to build that connection. Unfortunately, not everyone has those kind of resources ready and available.

Instead, you can leverage PyApacheAtlas and its Excel features to fill in a spreadsheet in a certain way and then push the content of the spreadsheet, in the right format, to Purview or Atlas' REST API. This lets you move faster with getting your metadata into your service, reduce the amount of code that needs to be written, and gives this power to business users and not just IT.

Starting with a simple spreadsheet

typeName qualifiedName name
DataSet custom://unique-data-source My Data Source

Save the above content to an excel spreadsheet on a tab called BulkEntities. This is the absolute minimum necessary to do an upload if the type of entity has no additional required attributes.

Next, we can upload the contents using this snippet. Be sure to have installed and configured PyApacheAtlas and include the relevant client (PurviewClient for Purview and AtlasClient for Apache Atlas).

Here's an example using Azure Purview and the Azure CLI.

import json
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core.client import PurviewClient
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader

auth = ServicePrincipalAuthentication(
    tenant_id = "replace_with_tenant_id",
    client_id = "replace_with_client_id",
    client_secret = "replace_with_client_secret"
)
client = PurviewClient(
    account_name= "PurviewAccountName",
    authentication = auth
)

ec = ExcelConfiguration()
reader = ExcelReader(ec)

entities = reader.parse_bulk_entities("path/to/spreadsheet.xlsx")

results = client.upload_entities(entities)

print(json.dumps(results, indent=2))

Assuming you've authenticated properly, you will have uploaded one entity to Purview!

What happened in this script?

Adding Attributes

The first table only showed the required fields. If you add any other column headers and fill in a cell within that column PyApacheAtlas will automatically add that attribute to the entity on the given row of the cell. Here's an example where we add a description.

typeName qualifiedName name description type
DataSet custom://unique-data-source My Data Source This is my cool description
column custom://unique-column My Custom Column This is my cool column integer

We've added a second row that will create a column type entity. In addition we've added a description column and a type column. Description is applicable to every entity in Apache Atlas and Purview. Type is applicable to the column type (it's a required attribute) but it's not applicable to a DataSet type. Leaving the field blank will prevent it from being included for that row.

Try to update your spreadsheet we worked on earlier and re-run the script and see the results!

A common follow-up question, especially for built-in types, is how do you know which attributes are required? There are two approaches:

Adding Connections to Other Entities (Relationships)

Now let's say you wanted to take advantage of Purview's / Atlas' ability to connect entities to each other through the "relationship" feature (learn more about custom types and relationships in Purview and Atlas). PyApacheAtlas supports the ability to connect entities being uploaded at the same time.

typeName qualifiedName name data_type [Relationship] db
hive_table custom://my/custom/hivetable My Data Source
hive_column custom://my/custom/hivetable#col01 col01 integer custom://my/custom/hivetable

In this table, we added a [Relationship] db column that takes the qualified name of an asset that was already mentioned in the spreadsheet. The [Relationship] prefix is a special keyword for PyApacheAtlas that helps the package to know where to send this attribute since Atlas has both "Attributes" and "Relationship Attributes" which could possibly be named the same.

Where did db come from? You have to be familiar with the type you are working with. By calling a method like client.get_all_typedefs() I can see how hive_column is defined and see that it has a relationship attribute called db that points to a hive_table (i.e. a column points to one database and a database points to many columns).

When this BulkEntities tab is parsed, the hive_column entity will now have a relationship attribute called db filled in with a pointer (specifically an AtlasObjectId object) to the hive_table being created in this upload.

An important note: This is not an append operation. Providing a value here will overwrite the relationship attribute for the given entity.

Referencing Entities Not Being Uploaded with AtlasObjectId

Some entity types have required relationships (like an Azure SQL table requires an Azure SQL Schema). You don't want to have to specify a db schema because that would require also specifying a database and that requires specifying a server!

To avoid this dependency chain, you can specify a relationship column header but provide a special value: the AtlasObjectId. The cell's value should be one of:

Here's an example sheet using the guid of the related azure_sql_schema entity.

typeName qualifiedName name [Relationship] dbSchema
azure_sql_table mssql://server/db/schema/MyTableName MyTableName AtlasObjectId(guid:123-abc-456)

Alternatively, I can specify the typeName and qualifiedName

typeName qualifiedName name [Relationship] dbSchema
azure_sql_table mssql://server/db/schema/MyTableName MyTableName AtlasObjectId(typeName:azure_sql_schema qualifiedName:mssql://server/db/schema)

Following these examples, you can apply this to any relationship attribute (other than meanings) and have a simpler upload when working with existing objects and their related entities.

Relating Multiple Entities in One Relationship Attribute

The [Relationship] columns support the configurable delimiter (semi-colon (;) by default).

For example, using the AtlasObjectId, you want to upload a table and associate multiple columns to that table. This example takes two columns you're creating at upload time and associates an existing column based on guid.

typeName qualifiedName name [Relationship] columns
hive_table hive://tableA#col01 col01
hive_table hive://tableA#col02 col02
hive_table hive://tableA tableA hive://tableA#col01;hive://tableA#col01;AtlasObectId(guid:xx-yy-zz)

Adding Glossary Terms

typeName qualifiedName name [Relationship] meanings
DataSet custom://my/custom/datasetwithmeanings entity with meanings my term;my other term

The "meanings" relationship attribute can be provided a set of glossary terms (default delimiter is semi-colon (;)). For Apache Atlas users, an important note is that this glossary term is hard coded to be the default Glossary (where the terms qualified names will resolve to my term@Glossary).

An important note: you can only apply meanings to new entities. To update an entity with new terms, you'll need to call the assignTerm endpoint.

Adding Classifications

typeName qualifiedName name [root] classifications
DataSet custom://my/custom/datasetwithclassifs entity with classifs MICROSOFT.IPADDRESS;MICROSOFT.SOMETHING.ELSE

An important note: you can only apply classifications in the spreadsheet to new entities. To update entities and their classifications, you need to call another endpoint using classify_bulk_entities or classify_entity.

Adding experts and owners

Two more special column headers, experts and owners will take AAD Object Ids (by default) and associate them with your entity. However, most users don't know or don't know how to look up AAD Object Ids. Thankfully, PyApacheAtlas provides a utility to lookup AAD Object Ids based on User Principal Name or primary email address.

For example, assuming you had a spreadsheet that looked like this:

typeName qualifiedName name experts
DataSet custom://my/custom/ds expert ds bill@example.com;will@example.com

Our starter script from above would need to make the below change to the parse_bulk_entities call.

entities = reader.parse_bulk_entities(
    "path/to/spreadsheet.xlsx",
    contacts_func = client.msgraph.email_to_id
)

The function passed into contacts_func will get executed on every expert or owner provided. In this case, the PurviewClient.msgraph.email_to_id method will take an email address, query it against the Microsoft Graph API and then return the id. To avoid unnecessary calls to the function you provide, it has a built-in dictionary that stores the results for the given parsing.

Adding custom Attributes

typeName qualifiedName name [custom] attrib1 [custom] attrib2
DataSet custom://my/custom/datasetwithcustom entity with custom stuff string1 string2

For Apache Atlas users, the excel spreadsheet supports adding one or many custom attributes to the entity.

An important note: you can only custom attributes to new entities. To update entities, you need to call another endpoint.

Adding Labels for Apache Atlas

typeName qualifiedName name [root] labels
DataSet custom://my/custom/datasetwithlabels entity with classifs labelA;labelB

For Apache Atlas users, the excel spreadsheet supports adding one or many labels to the entity.

An important note: you can only apply labels to new entities. To update entities, you need to call another endpoint.

Recap

You now have seen how to take advantage of the BulkEntities tab to: