Lineage in Purview and Apache Atlas is all about connecting entities together through Process
entities. The Process entity has an inputs attribute and an outputs attribute. Each one takes an array of entities.
For the UpdateLineage
tab in the PyApacheAtlas spreadsheet, we can take existing inputs and output entities and feed them into a new or existing Process entity.
Target typeName | Target qualifiedName | Source typeName | Source qualifiedName | Process name | Process qualifiedName | Process typeName |
---|---|---|---|---|---|---|
DataSet | custom://target-that-exists | DataSet | custom://source-that-exists | My Custom Process | custom://process-to-be-made | Process |
This spreadsheet defines a custom lineage that includes:
If you're not a fan of "Source" and "Target" as the prefix, those can be changed in the Excel Configuration for PyApacheAtlas.
Save the above content to an excel spreadsheet on a tab called UpdateLineage
. This is the absolute minimum necessary to do an upload if the Process entity has no additional required attributes or inputs and outputs.
Next, we can upload the contents using this snippet. Be sure to have installed and configured PyApacheAtlas and include the relevant client (PurviewClient for Purview and AtlasClient for Apache Atlas).
Here's an example using Azure Purview and the Azure CLI.
import json
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core.client import PurviewClient
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader
auth = ServicePrincipalAuthentication(
tenant_id = "replace_with_tenant_id",
client_id = "replace_with_client_id",
client_secret = "replace_with_client_secret"
)
client = PurviewClient(
account_name= "PurviewAccountName",
authentication = auth
)
ec = ExcelConfiguration()
reader = ExcelReader(ec)
entities = reader.parse_update_lineage('path/to/spreadsheet.xlsx")
results = client.upload_entities(entities)
print(json.dumps(results, indent=2))
Assuming you've authenticated properly, you will have uploaded one entity to Purview!
What happened in this script?
azure-identity
and the AzureCliCredential
for even less code having to be written.parse_update_lineage
and does the actual upload to your Atlas or Purview service.json.dumps
turns a Python dictionary
and turns it into a string. The indent=2
tells Python to add two spaces for each level in the resulting json.Target typeName | Target qualifiedName | Source typeName | Source qualifiedName | Process name | Process qualifiedName | Process typeName |
---|---|---|---|---|---|---|
DataSet | custom://target-that-exists | DataSet | custom://source-that-exists | My Custom Process | custom://process-to-be-made | Process |
DataSet | custom://2nd-source-that-exists | My Custom Process | custom://process-to-be-made | Process | ||
DataSet | custom://3rd-source-that-exists | My Custom Process | custom://process-to-be-made | Process |
We need to specify the target only one time. If you specify the same target or source multiple times, you'll get a warning.
Target typeName | Target qualifiedName | Source typeName | Source qualifiedName | Process name | Process qualifiedName | Process typeName |
---|---|---|---|---|---|---|
DataSet | custom://target-that-exists | DataSet | custom://source-that-exists | My Custom Process | custom://process-to-be-made | Process |
DataSet | custom://2nd-target-that-exists | DataSet | custom://2nd-source-that-exists | My Custom Process | custom://process-to-be-made | Process |
DataSet | custom://3rd-source-that-exists | My Custom Process | custom://process-to-be-made | Process |
Since the inputs and outputs of a Process entity are arrays, their order doesn't really matter. We could have put target-that-exists
on row 2 and 2nd-target-that-exists
on row 1. If you have a process that creates multiple outputs and you really need to specify that Table A and Table B make Table X but Table A and Table C make Table Y then you might want to consider using multiple process entities or leverage Purview's column mapping feature.
Target typeName | Target qualifiedName | Source typeName | Source qualifiedName | Process name | Process qualifiedName | Process typeName |
---|---|---|---|---|---|---|
N/A | DataSet | custom://source-that-exists | Input Only Process | custom://input-only-process | Process | |
DataSet | custom://source-that-exists02 | N/A | My Custom Process | custom://output-only-process | Process |
In this example, the first row has a process that will have one source/input and will ensure the target/output is nothing / an empty list. The target's qualified name is set to N/A which is a special keyword that forces the input to be a blank list. The same applies to the qualified name for the source column as well.
Leaving the target qualified name completely blank would indicate "no change" to the target / outputs for an existing entity. If the entity is new, it will default to an empty list in the Atlas / Purview service. The same applies to the qualified name for the source columns as well.
client.parse_update_lineage
method to define and extract entities for creating custom Lineage.