Home OSS About Privacy

Custom Lineage with Excel for Purview and Apache Atlas

Lineage in Purview and Apache Atlas is all about connecting entities together through Process entities. The Process entity has an inputs attribute and an outputs attribute. Each one takes an array of entities.

For the UpdateLineage tab in the PyApacheAtlas spreadsheet, we can take existing inputs and output entities and feed them into a new or existing Process entity.

Starting with a simple spreadsheet

Target typeName Target qualifiedName Source typeName Source qualifiedName Process name Process qualifiedName Process typeName
DataSet custom://target-that-exists DataSet custom://source-that-exists My Custom Process custom://process-to-be-made Process

This spreadsheet defines a custom lineage that includes:

If you're not a fan of "Source" and "Target" as the prefix, those can be changed in the Excel Configuration for PyApacheAtlas.

Save the above content to an excel spreadsheet on a tab called UpdateLineage. This is the absolute minimum necessary to do an upload if the Process entity has no additional required attributes or inputs and outputs.

Next, we can upload the contents using this snippet. Be sure to have installed and configured PyApacheAtlas and include the relevant client (PurviewClient for Purview and AtlasClient for Apache Atlas).

Here's an example using Azure Purview and the Azure CLI.

import json
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core.client import PurviewClient
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader

auth = ServicePrincipalAuthentication(
    tenant_id = "replace_with_tenant_id",
    client_id = "replace_with_client_id",
    client_secret = "replace_with_client_secret"
)
client = PurviewClient(
    account_name= "PurviewAccountName",
    authentication = auth
)

ec = ExcelConfiguration()
reader = ExcelReader(ec)

entities = reader.parse_update_lineage('path/to/spreadsheet.xlsx")

results = client.upload_entities(entities)

print(json.dumps(results, indent=2))

Assuming you've authenticated properly, you will have uploaded one entity to Purview!

What happened in this script?

Multiple Inputs for a Single Output

Target typeName Target qualifiedName Source typeName Source qualifiedName Process name Process qualifiedName Process typeName
DataSet custom://target-that-exists DataSet custom://source-that-exists My Custom Process custom://process-to-be-made Process
DataSet custom://2nd-source-that-exists My Custom Process custom://process-to-be-made Process
DataSet custom://3rd-source-that-exists My Custom Process custom://process-to-be-made Process

We need to specify the target only one time. If you specify the same target or source multiple times, you'll get a warning.

Multiple Inputs and Multiple Outputs

Target typeName Target qualifiedName Source typeName Source qualifiedName Process name Process qualifiedName Process typeName
DataSet custom://target-that-exists DataSet custom://source-that-exists My Custom Process custom://process-to-be-made Process
DataSet custom://2nd-target-that-exists DataSet custom://2nd-source-that-exists My Custom Process custom://process-to-be-made Process
DataSet custom://3rd-source-that-exists My Custom Process custom://process-to-be-made Process

Since the inputs and outputs of a Process entity are arrays, their order doesn't really matter. We could have put target-that-exists on row 2 and 2nd-target-that-exists on row 1. If you have a process that creates multiple outputs and you really need to specify that Table A and Table B make Table X but Table A and Table C make Table Y then you might want to consider using multiple process entities or leverage Purview's column mapping feature.

Ensuring an Input or Output is an Empty List

Target typeName Target qualifiedName Source typeName Source qualifiedName Process name Process qualifiedName Process typeName
N/A DataSet custom://source-that-exists Input Only Process custom://input-only-process Process
DataSet custom://source-that-exists02 N/A My Custom Process custom://output-only-process Process

In this example, the first row has a process that will have one source/input and will ensure the target/output is nothing / an empty list. The target's qualified name is set to N/A which is a special keyword that forces the input to be a blank list. The same applies to the qualified name for the source column as well.

Leaving the target qualified name completely blank would indicate "no change" to the target / outputs for an existing entity. If the entity is new, it will default to an empty list in the Atlas / Purview service. The same applies to the qualified name for the source columns as well.

Recap