Home OSS About Privacy

Using Excel for Column Lineage / Mapping in Purview with PyApacheAtlas

If you're using Azure Purview, you can take advantage of the Lineage UI feature called "column mapping". Using PyApacheAtlas, you can load your columns If you're looking to do this by hand, there's a great tutorial on creating custom lineage with column mapping in code.

Starting with a simple spreadsheet

Unfortunately, we need to create a custom Process type before we can use the ColumnMapping tab. See the custom types with excel sample about custom process types before you attempt to upload

Target column Target qualifiedName Source column Source qualifiedName Process name Process qualifiedName Process typeName
destcolA custom://target-that-exists colA custom://source-that-exists My Custom Process custom://process-to-be-made CustomProcess
destcolB custom://target-that-exists colB custom://source-that-exists My Custom Process custom://process-to-be-made CustomProcess
destcolCombo custom://target-that-exists colC custom://source-that-exists My Custom Process custom://process-to-be-made CustomProcess
destcolCombo custom://target-that-exists colD custom://source-that-exists My Custom Process custom://process-to-be-made CustomProcess

This spreadsheet defines a custom mapping that includes:

If you're not a fan of "Source" and "Target" as the prefix, those can be changed in the Excel Configuration for PyApacheAtlas.

Save the above content to an excel spreadsheet on a tab called ColumnMapping. This is the absolute minimum necessary to do an upload.

Next, we can upload the contents using this snippet. Be sure to have installed and configured PyApacheAtlas.

Here's an example using Azure Purview and Service Principal.

import json
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core.client import PurviewClient
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader

auth = ServicePrincipalAuthentication(
    tenant_id = "replace_with_tenant_id",
    client_id = "replace_with_client_id",
    client_secret = "replace_with_client_secret"
)
client = PurviewClient(
    account_name= "PurviewAccountName",
    authentication = auth
)

ec = ExcelConfiguration()
reader = ExcelReader(ec)

entities = reader.parse_column_mapping("path/to/spreadsheet.xlsx")

results = client.upload_entities(entities)

print(json.dumps(results, indent=2))

Assuming you've authenticated properly, you will have uploaded one entity to Purview!

What happened in this script?

Combining UpdateLineage with ColumnMapping tabs

If you're defining brand new lineage, you might used the Purview custom lineage excel feature and the UpdateLineage tab to define inputs and outputs at the table level. The ColumnMapping tab defines the Purview data structure for activating the Column Mapping feature in the Lineage tab.

You can combine the results of both tabs easily by calling the reader.parse_update_lineage_with_mappings(filepath) method instead. This will take the results from the UpdateLineage tab and apply the columnMappings attribute to those entities.

Recap