Using Excel for Column Lineage / Mapping in Purview with PyApacheAtlas

If you're using Azure Purview, you can take advantage of the Lineage UI feature called "column mapping". Using PyApacheAtlas, you can load your columns If you're looking to do this by hand, there's a great tutorial on creating custom lineage with column mapping in code.

Starting with a simple spreadsheet
Combining UpdateLineage with ColumnMapping tabs
Recap

Starting with a simple spreadsheet

Unfortunately, we need to create a custom Process type before we can use the ColumnMapping tab. See the custom types with excel sample about custom process types before you attempt to upload

Target column	Target qualifiedName	Source column	Source qualifiedName	Process name	Process qualifiedName	Process typeName
destcolA	custom://target-that-exists	colA	custom://source-that-exists	My Custom Process	custom://process-to-be-made	CustomProcess
destcolB	custom://target-that-exists	colB	custom://source-that-exists	My Custom Process	custom://process-to-be-made	CustomProcess
destcolCombo	custom://target-that-exists	colC	custom://source-that-exists	My Custom Process	custom://process-to-be-made	CustomProcess
destcolCombo	custom://target-that-exists	colD	custom://source-that-exists	My Custom Process	custom://process-to-be-made	CustomProcess

This spreadsheet defines a custom mapping that includes:

A Process entity that will be updated or created - repeated for every row with a column mapping involved.
- The Process type is of type "CustomProcess" which must exist before the upload and should have a string attribute called "columnMapping" in its definition.
Three target columns (destcolA, destcolB, destcolCombo) on the same target table (based on the qualified name).
Four source columns (colA - colD) on the same source table (based on the qualified name).
One target (destcolCombo) is a combination of two source columns (colC and colD)
The qualified names are for the source and target tables. The column mapping will only be visible if the source and target qualified names are found in the process entity's inputs and outputs respectively.
The Source/Target column represents a free text field that represents the columns friendly name. This column does not have to exist as part of the schema tab to show up on the Lineage tab.

If you're not a fan of "Source" and "Target" as the prefix, those can be changed in the Excel Configuration for PyApacheAtlas.

Save the above content to an excel spreadsheet on a tab called ColumnMapping. This is the absolute minimum necessary to do an upload.

Next, we can upload the contents using this snippet. Be sure to have installed and configured PyApacheAtlas.

Here's an example using Azure Purview and Service Principal.

import json
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core.client import PurviewClient
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader

auth = ServicePrincipalAuthentication(
    tenant_id = "replace_with_tenant_id",
    client_id = "replace_with_client_id",
    client_secret = "replace_with_client_secret"
)
client = PurviewClient(
    account_name= "PurviewAccountName",
    authentication = auth
)

ec = ExcelConfiguration()
reader = ExcelReader(ec)

entities = reader.parse_column_mapping("path/to/spreadsheet.xlsx")

results = client.upload_entities(entities)

print(json.dumps(results, indent=2))

Assuming you've authenticated properly, you will have uploaded one entity to Purview!

What happened in this script?

PurviewClient and ServicePrincipalAuthentication: These are necessary boilerplate to set up communication with Azure Purview. Alternatively, you could use azure-identity and the AzureCliCredential for even less code having to be written.
ExcelConfiguration and ExcelReader: These define methods for extracting entities from the standard excel template. You can customize the ExcelConfiguration if you don't want to use the standard template.
parse_column_mapping: This method reads from, by default, the ColumnMapping tab and converts what you've provided into Atlas Entities with an attribute called columnMapping ready to be uploaded to Azure Purview. Note: This method does NOT do an upload, it merely parses the spreadsheet tab.
upload_entities: This method takes the entities we extracted with parse_update_lineage and does the actual upload to your Atlas or Purview service.
print(json.dumps()): This print statement takes the output of our upload and prints it nicely with some indentation. json.dumps turns a Python dictionary and turns it into a string. The indent=2 tells Python to add two spaces for each level in the resulting json.

Combining UpdateLineage with ColumnMapping tabs

If you're defining brand new lineage, you might used the Purview custom lineage excel feature and the UpdateLineage tab to define inputs and outputs at the table level. The ColumnMapping tab defines the Purview data structure for activating the Column Mapping feature in the Lineage tab.

You can combine the results of both tabs easily by calling the reader.parse_update_lineage_with_mappings(filepath) method instead. This will take the results from the UpdateLineage tab and apply the columnMappings attribute to those entities.

Recap

Use the ColumnMapping tab to activate the column mapping feature in Azure Purview's lineage UI.
You must have a process type that has a columnMapping string attribute.
Use the ExcelReader.parse_column_mapping(filepath) method to extract the Process entities that would be uploaded with PurviewClient.upload_entities.
Use the ExcelReader.parse_update_lineage_with_mapping(filepath) method to extract Process entities from both the UpdateLineage and ColumnMapping tabs