If you're using Azure Purview, you can take advantage of the Lineage UI feature called "column mapping". Using PyApacheAtlas, you can load your columns If you're looking to do this by hand, there's a great tutorial on creating custom lineage with column mapping in code.
Unfortunately, we need to create a custom Process type before we can use the ColumnMapping tab. See the custom types with excel sample about custom process types before you attempt to upload
Target column | Target qualifiedName | Source column | Source qualifiedName | Process name | Process qualifiedName | Process typeName |
---|---|---|---|---|---|---|
destcolA | custom://target-that-exists | colA | custom://source-that-exists | My Custom Process | custom://process-to-be-made | CustomProcess |
destcolB | custom://target-that-exists | colB | custom://source-that-exists | My Custom Process | custom://process-to-be-made | CustomProcess |
destcolCombo | custom://target-that-exists | colC | custom://source-that-exists | My Custom Process | custom://process-to-be-made | CustomProcess |
destcolCombo | custom://target-that-exists | colD | custom://source-that-exists | My Custom Process | custom://process-to-be-made | CustomProcess |
This spreadsheet defines a custom mapping that includes:
If you're not a fan of "Source" and "Target" as the prefix, those can be changed in the Excel Configuration for PyApacheAtlas.
Save the above content to an excel spreadsheet on a tab called ColumnMapping
. This is the absolute minimum necessary to do an upload.
Next, we can upload the contents using this snippet. Be sure to have installed and configured PyApacheAtlas.
Here's an example using Azure Purview and Service Principal.
import json
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core.client import PurviewClient
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader
auth = ServicePrincipalAuthentication(
tenant_id = "replace_with_tenant_id",
client_id = "replace_with_client_id",
client_secret = "replace_with_client_secret"
)
client = PurviewClient(
account_name= "PurviewAccountName",
authentication = auth
)
ec = ExcelConfiguration()
reader = ExcelReader(ec)
entities = reader.parse_column_mapping("path/to/spreadsheet.xlsx")
results = client.upload_entities(entities)
print(json.dumps(results, indent=2))
Assuming you've authenticated properly, you will have uploaded one entity to Purview!
What happened in this script?
azure-identity
and the AzureCliCredential
for even less code having to be written.parse_update_lineage
and does the actual upload to your Atlas or Purview service.json.dumps
turns a Python dictionary
and turns it into a string. The indent=2
tells Python to add two spaces for each level in the resulting json.If you're defining brand new lineage, you might used the Purview custom lineage excel feature and the UpdateLineage tab to define inputs and outputs at the table level. The ColumnMapping tab defines the Purview data structure for activating the Column Mapping feature in the Lineage tab.
You can combine the results of both tabs easily by calling the reader.parse_update_lineage_with_mappings(filepath)
method instead. This will take the results from the UpdateLineage tab and apply the columnMappings attribute to those entities.
ExcelReader.parse_column_mapping(filepath)
method to extract the Process entities that would be uploaded with PurviewClient.upload_entities
.ExcelReader.parse_update_lineage_with_mapping(filepath)
method to extract Process entities from both the UpdateLineage and ColumnMapping tabs