If you're interested in creating custom types, creating custom lineage, or building a custom connector, you'll likely want to use PyApacheAtlas to support your application feeding metadata to Azure Purview or Apache Atlas.
This guide will show you the steps for installing and configuring PyApacheAtlas.
PyApacheAtlas requires you to have the Python Language installed on the machine you're using to execute the scripts. Specifically it requires Python 3.6 or higher. If you are using an Azure PaaS compute service (like Databricks, Synapse, or Azure ML Compute Instances), you likely already have Python installed. But if you're using your local machine, you may need to do the following install.
If you plan on using Python for other projects, it's a good idea to use a virtual environment which keeps all of the relevant libraries / dependencies related to your project in one spot, isolated from any other project.
With Python installed, you should be able to execute the python -m pip install virtualenv
to install this utility which will allow us to create this isolated environment.
With this library installed in our global list of python libraries, we can create a new virtual environment in any folder. Navigate to the folder you plan on performing your work in and run the following command python -m virtualenv env
. This creates a folder named env
and now houses a copy of Python and specific libraries we install.
To use this virtual environment, you need to activate the virtual environment.
env\Scripts\activate
source env/bin/activate
You'll then see your command line have the word (env)
in front of your regular prompt.
Now that you have Python installed (and optionally have activated a virtual environment), you can install PyApacheAtlas from PyPi. PyPi is a package index that contains the source code for hundreds of thousands of Python packages. If your target machine can access PyPi, it is a very convenient source of installations.
To install on your local machine, you need only run the command python -m pip install pyapacheatlas
for the first time. If you already have it installed and want to upgrade, we just need to add the upgrade flag python -m pip install --upgrade pyapacheatlas
. This will install PyApacheAtlas and any of the dependent libraries necessary to run the code.
If you plan on using your Azure CLI credentials to work with Purview, you will also need to install azure-identity
. Run python -m pip install azure-identity
to install that library to enable using your personal credential to access Purview.
For Azure PaaS services, please see the official docs for installation of Python packages:
Some companies don't allow connection to PyPi or require that source code is vetted and stored in a secured area. Consider using Python Artifacts in Azure DevOps to store vetted and pre-built versions of PyApacheAtlas.
Depending on your target, you must authenticate in a specific way since Atlas only support Basic Authentication with username and password while Purview supports Azure Active Directory.
You'll need to collect your username and password as well as the Atlas API endpoint. The code below expects you to have ATLAS_USERNAME
and ATLAS_PASSWORD
stored as environment variables or hard coded in the script.
import os
from pyapacheatlas.auth import BasicAuthentication
basic_auth = BasicAuthentication(
username=os.environ.get("ATLAS_USERNAME", "OrHardCodeHere"),
password=os.environ.get("ATLAS_PASSWORD", "OrHardCodeHere")
)
You'll need to collect your service principal's tenant id (a.k.a. directory id), client id, client secret, as well as the name of your Purview instance. The code below expects you to have the AZURE_...
and PURVIEW_NAME
as environment variables or hard coded in the script. See the Purview official docs for more information on creating a service principal and collecting this information.
import os
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core.client import PurviewClient
# If you plan on using a service principal
oauth = ServicePrincipalAuthentication(
tenant_id=os.environ.get("AZURE_TENANT_ID", "OrHardCodeHere"),
client_id=os.environ.get("AZURE_CLIENT_ID", "OrHardCodeHere"),
client_secret=os.environ.get("AZURE_CLIENT_SECRET", "OrHardCodeHere")
)
client = PurviewClient(
account_name=os.environ.get("PURVIEW_NAME", "InsertDefaultHere"),
authentication=oauth
)
Using the azure-identity
package makes it extremely easy to get started with Purview.
In this case, you need only your Purview instance name either as an environment variable or hard coded in the script.
Executing the connection below assumes you have the Azure CLI installed, you are logged into the relevant subscription, and you have the Purview Data Curator and Collection admin role for yourself.
import os
from pyapacheatlas.core.client import PurviewClient
# If you plan on using azure-identity
from azure.identity import AzureCliCredential
credential = AzureCliCredential()
client = PurviewClient(
account_name=os.environ.get("PURVIEW_NAME", "InsertDefaultHere"),
authentication=credential
)
For those using Azure Purview, the critical roles for the Service Principal or user executing API scripts are the Purview Data Curator role and collection Admin. It provides the necessary permissions to create, edit, or delete any asset in Purview. See the official docs on how to assign roles to a given identity.
With everything set up, it's time to finally execute a script and confirm everything works as expected. Review the below code and comment out the authentication methods that aren't relevant to your chosen authentication method. At the end of the script, you will print out all of the type definitions available in Purview or Apache Atlas by default. If that succeeds, you've gotten through all the steps successfully! If it fails, see the F.A.Q. below.
import json
import os
from pyapacheatlas.auth import ServicePrincipalAuthentication, BasicAuthentication
from pyapacheatlas.core.client import PurviewClient, AtlasClient
# If you plan on using azure-identity
from azure.identity import AzureCliCredential
credential = AzureCliCredential()
client = PurviewClient(
account_name=os.environ.get("PURVIEW_NAME", "InsertDefaultHere"),
authentication=credential
)
# If you plan on using a service principal
oauth = ServicePrincipalAuthentication(
tenant_id=os.environ.get("AZURE_TENANT_ID", "OrHardCodeHere"),
client_id=os.environ.get("AZURE_CLIENT_ID", "OrHardCodeHere"),
client_secret=os.environ.get("AZURE_CLIENT_SECRET", "OrHardCodeHere")
)
client = PurviewClient(
account_name=os.environ.get("PURVIEW_NAME", "InsertDefaultHere"),
authentication=oauth
)
# If you plan on using Basic Auth for Apache Atlas
basic_auth = BasicAuthentication(
username=os.environ.get("ATLAS_USERNAME", "OrHardCodeHere"),
password=os.environ.get("ATLAS_PASSWORD", "OrHardCodeHere")
)
client = AtlasClient(
endpoint_url="https://yourendpoint.url/atlas/api/v2/",
authentication = basic_auth
)
results = client.get_all_typedefs()
print(json.dumps(results, indent=2))