Custom Lineage with Excel for Purview and Apache Atlas

Lineage in Purview and Apache Atlas is all about connecting entities together through Process entities. The Process entity has an inputs attribute and an outputs attribute. Each one takes an array of entities.

Starting with a simple spreadsheet
Multiple Inputs for a Single Output
Multiple Inputs and Multiple Outputs
Ensuring an Input or Output is an Empty List
Recap

For the UpdateLineage tab in the PyApacheAtlas spreadsheet, we can take existing inputs and output entities and feed them into a new or existing Process entity.

Starting with a simple spreadsheet

Target typeName	Target qualifiedName	Source typeName	Source qualifiedName	Process name	Process qualifiedName	Process typeName
DataSet	custom://target-that-exists	DataSet	custom://source-that-exists	My Custom Process	custom://process-to-be-made	Process

This spreadsheet defines a custom lineage that includes:

A Process entity that will be updated or created.
One Target entity (defined by type and qualified name) as one of the outputs. The entity must exist already in your Purview / Atlas service.
One Source entity (defined by type and qualified name) as one of the inputs. The entity must exist already in your Purview / Atlas service.

If you're not a fan of "Source" and "Target" as the prefix, those can be changed in the Excel Configuration for PyApacheAtlas.

Save the above content to an excel spreadsheet on a tab called UpdateLineage. This is the absolute minimum necessary to do an upload if the Process entity has no additional required attributes or inputs and outputs.

Next, we can upload the contents using this snippet. Be sure to have installed and configured PyApacheAtlas and include the relevant client (PurviewClient for Purview and AtlasClient for Apache Atlas).

Here's an example using Azure Purview and the Azure CLI.

import json
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core.client import PurviewClient
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader

auth = ServicePrincipalAuthentication(
    tenant_id = "replace_with_tenant_id",
    client_id = "replace_with_client_id",
    client_secret = "replace_with_client_secret"
)
client = PurviewClient(
    account_name= "PurviewAccountName",
    authentication = auth
)

ec = ExcelConfiguration()
reader = ExcelReader(ec)

entities = reader.parse_update_lineage('path/to/spreadsheet.xlsx")

results = client.upload_entities(entities)

print(json.dumps(results, indent=2))

Assuming you've authenticated properly, you will have uploaded one entity to Purview!

What happened in this script?

PurviewClient and ServicePrincipalAuthentication: These are necessary boilerplate to set up communication with Azure Purview. Alternatively, you could use azure-identity and the AzureCliCredential for even less code having to be written.
ExcelConfiguration and ExcelReader: These define methods for extracting entities from the standard excel template. You can customize the ExcelConfiguration if you don't want to use the standard template.
parse_update_lineage: This method reads from, by default, the UpdateLineage tab and converts what you've provided into Atlas Entities ready to be uploaded to Azure Purview or Apache Atlas. Note: This method does NOT do an upload, it merely parses the spreadsheet tab.
upload_entities: This method takes the entities we extracted with parse_update_lineage and does the actual upload to your Atlas or Purview service.
print(json.dumps()): This print statement takes the output of our upload and prints it nicely with some indentation. json.dumps turns a Python dictionary and turns it into a string. The indent=2 tells Python to add two spaces for each level in the resulting json.

Multiple Inputs for a Single Output

Target typeName	Target qualifiedName	Source typeName	Source qualifiedName	Process name	Process qualifiedName	Process typeName
DataSet	custom://target-that-exists	DataSet	custom://source-that-exists	My Custom Process	custom://process-to-be-made	Process
		DataSet	custom://2nd-source-that-exists	My Custom Process	custom://process-to-be-made	Process
		DataSet	custom://3rd-source-that-exists	My Custom Process	custom://process-to-be-made	Process

We need to specify the target only one time. If you specify the same target or source multiple times, you'll get a warning.

Multiple Inputs and Multiple Outputs

Target typeName	Target qualifiedName	Source typeName	Source qualifiedName	Process name	Process qualifiedName	Process typeName
DataSet	custom://target-that-exists	DataSet	custom://source-that-exists	My Custom Process	custom://process-to-be-made	Process
DataSet	custom://2nd-target-that-exists	DataSet	custom://2nd-source-that-exists	My Custom Process	custom://process-to-be-made	Process
		DataSet	custom://3rd-source-that-exists	My Custom Process	custom://process-to-be-made	Process

Since the inputs and outputs of a Process entity are arrays, their order doesn't really matter. We could have put target-that-exists on row 2 and 2nd-target-that-exists on row 1. If you have a process that creates multiple outputs and you really need to specify that Table A and Table B make Table X but Table A and Table C make Table Y then you might want to consider using multiple process entities or leverage Purview's column mapping feature.

Ensuring an Input or Output is an Empty List

Target typeName	Target qualifiedName	Source typeName	Source qualifiedName	Process name	Process qualifiedName	Process typeName
	N/A	DataSet	custom://source-that-exists	Input Only Process	custom://input-only-process	Process
DataSet	custom://source-that-exists02		N/A	My Custom Process	custom://output-only-process	Process

In this example, the first row has a process that will have one source/input and will ensure the target/output is nothing / an empty list. The target's qualified name is set to N/A which is a special keyword that forces the input to be a blank list. The same applies to the qualified name for the source column as well.

Leaving the target qualified name completely blank would indicate "no change" to the target / outputs for an existing entity. If the entity is new, it will default to an empty list in the Atlas / Purview service. The same applies to the qualified name for the source columns as well.

Recap

Use an UpdateLineage tab in an Excel Spreadsheet and the client.parse_update_lineage method to define and extract entities for creating custom Lineage.
The spreadsheet must have column headers:
- Target typeName
- Target qualifiedName
- Source typeName
- Source qualifiedName
- Process name
- Process qualifiedName
- Process typeName
If you make a qualified name "N/A" it will force the entity into an empty list, ensuring the inputs or outputs (depending on which qualified name column has the N/A) is overwritten with an empty list.
You can specify multiple sources and targets for a given process.
A given source or target should be specified only once per process.