Home OSS About Privacy

Installing and Authenticating with PyApacheAtlas

If you're interested in creating custom types, creating custom lineage, or building a custom connector, you'll likely want to use PyApacheAtlas to support your application feeding metadata to Azure Purview or Apache Atlas.

This guide will show you the steps for installing and configuring PyApacheAtlas.

To work with PyApacheAtlas, first install Python, use a virtualenv to install PyApacheAtlas and Azure Identity, then ensure you have the Azure CLI installed, ensure your identity has the necessary permission to read from Purview, and finally run the test script to confirm access.

Installing Python 3

PyApacheAtlas requires you to have the Python Language installed on the machine you're using to execute the scripts. Specifically it requires Python 3.6 or higher. If you are using an Azure PaaS compute service (like Databricks, Synapse, or Azure ML Compute Instances), you likely already have Python installed. But if you're using your local machine, you may need to do the following install.

(Optional) Creating a Virtual Environment

If you plan on using Python for other projects, it's a good idea to use a virtual environment which keeps all of the relevant libraries / dependencies related to your project in one spot, isolated from any other project.

With Python installed, you should be able to execute the python -m pip install virtualenv to install this utility which will allow us to create this isolated environment.

With this library installed in our global list of python libraries, we can create a new virtual environment in any folder. Navigate to the folder you plan on performing your work in and run the following command python -m virtualenv env. This creates a folder named env and now houses a copy of Python and specific libraries we install.

To use this virtual environment, you need to activate the virtual environment.

You'll then see your command line have the word (env) in front of your regular prompt.

Pip Install PyApacheAtlas

PyApacheAtlas Logo

Now that you have Python installed (and optionally have activated a virtual environment), you can install PyApacheAtlas from PyPi. PyPi is a package index that contains the source code for hundreds of thousands of Python packages. If your target machine can access PyPi, it is a very convenient source of installations.

To install on your local machine, you need only run the command python -m pip install pyapacheatlas for the first time. If you already have it installed and want to upgrade, we just need to add the upgrade flag python -m pip install --upgrade pyapacheatlas. This will install PyApacheAtlas and any of the dependent libraries necessary to run the code.

If you plan on using your Azure CLI credentials to work with Purview, you will also need to install azure-identity. Run python -m pip install azure-identity to install that library to enable using your personal credential to access Purview.

For Azure PaaS services, please see the official docs for installation of Python packages:

Some companies don't allow connection to PyPi or require that source code is vetted and stored in a secured area. Consider using Python Artifacts in Azure DevOps to store vetted and pre-built versions of PyApacheAtlas.

Authenticate Against Atlas or Purview

Depending on your target, you must authenticate in a specific way since Atlas only support Basic Authentication with username and password while Purview supports Azure Active Directory.

Basic Auth for Apache Atlas

You'll need to collect your username and password as well as the Atlas API endpoint. The code below expects you to have ATLAS_USERNAME and ATLAS_PASSWORD stored as environment variables or hard coded in the script.

import os

from pyapacheatlas.auth import BasicAuthentication

basic_auth = BasicAuthentication(
    username=os.environ.get("ATLAS_USERNAME", "OrHardCodeHere"),
    password=os.environ.get("ATLAS_PASSWORD", "OrHardCodeHere")
)

Service Principal Auth

You'll need to collect your service principal's tenant id (a.k.a. directory id), client id, client secret, as well as the name of your Purview instance. The code below expects you to have the AZURE_... and PURVIEW_NAME as environment variables or hard coded in the script. See the Purview official docs for more information on creating a service principal and collecting this information.

import os

from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core.client import PurviewClient


# If you plan on using a service principal
oauth = ServicePrincipalAuthentication(
    tenant_id=os.environ.get("AZURE_TENANT_ID", "OrHardCodeHere"),
    client_id=os.environ.get("AZURE_CLIENT_ID", "OrHardCodeHere"),
    client_secret=os.environ.get("AZURE_CLIENT_SECRET", "OrHardCodeHere")
)
client = PurviewClient(
    account_name=os.environ.get("PURVIEW_NAME", "InsertDefaultHere"),
    authentication=oauth
)

Azure-Identity

Using the azure-identity package makes it extremely easy to get started with Purview. In this case, you need only your Purview instance name either as an environment variable or hard coded in the script.

Executing the connection below assumes you have the Azure CLI installed, you are logged into the relevant subscription, and you have the Purview Data Curator and Collection admin role for yourself.

import os

from pyapacheatlas.core.client import PurviewClient

# If you plan on using azure-identity
from azure.identity import AzureCliCredential
credential = AzureCliCredential()
client = PurviewClient(
    account_name=os.environ.get("PURVIEW_NAME", "InsertDefaultHere"),
    authentication=credential
)

Azure Role Based Access Control

For those using Azure Purview, the critical roles for the Service Principal or user executing API scripts are the Purview Data Curator role and collection Admin. It provides the necessary permissions to create, edit, or delete any asset in Purview. See the official docs on how to assign roles to a given identity.

Confirm Connectivity

The output of your test should contain five types of definitions inside a JSON object.

With everything set up, it's time to finally execute a script and confirm everything works as expected. Review the below code and comment out the authentication methods that aren't relevant to your chosen authentication method. At the end of the script, you will print out all of the type definitions available in Purview or Apache Atlas by default. If that succeeds, you've gotten through all the steps successfully! If it fails, see the F.A.Q. below.

import json
import os

from pyapacheatlas.auth import ServicePrincipalAuthentication, BasicAuthentication
from pyapacheatlas.core.client import PurviewClient, AtlasClient

# If you plan on using azure-identity
from azure.identity import AzureCliCredential
credential = AzureCliCredential()
client = PurviewClient(
    account_name=os.environ.get("PURVIEW_NAME", "InsertDefaultHere"),
    authentication=credential
)

# If you plan on using a service principal
oauth = ServicePrincipalAuthentication(
    tenant_id=os.environ.get("AZURE_TENANT_ID", "OrHardCodeHere"),
    client_id=os.environ.get("AZURE_CLIENT_ID", "OrHardCodeHere"),
    client_secret=os.environ.get("AZURE_CLIENT_SECRET", "OrHardCodeHere")
)
client = PurviewClient(
    account_name=os.environ.get("PURVIEW_NAME", "InsertDefaultHere"),
    authentication=oauth
)

# If you plan on using Basic Auth for Apache Atlas
basic_auth = BasicAuthentication(
    username=os.environ.get("ATLAS_USERNAME", "OrHardCodeHere"),
    password=os.environ.get("ATLAS_PASSWORD", "OrHardCodeHere")
)
client = AtlasClient(
    endpoint_url="https://yourendpoint.url/atlas/api/v2/",
    authentication = basic_auth
)

results = client.get_all_typedefs()
print(json.dumps(results, indent=2))

Frequently Asked Questions