Home OSS About Privacy

Creating Custom Purview / Atlas Types in PyApacheAtlas

Custom Type Definitions need attribute definitions, relationship definitions, and a base type.

If you're using Azure Purview or another Apache Atlas flavored data catalog tool, you'll often run into an unsupported data source or tool that you want to catalog. Purview and Atlas provide you with a way to extend the tool with your own custom data source types! Read on to learn more about how you can extend Purview and Atlas with your own custom types.

By the end of this article, you'll feel comfortable writing code to create multiple custom types, relationships between types, and uploading those types to you Purview or Atlas instance.

Defining a Custom Type

Apache Atlas and Azure Purview provide a "type" system for your data catalog. That means each asset you have in your data catalog has a type which dictates what sort of attributes you can record and how it connects to other assets.

For example, a hive_table asset type lets you record attributes like: createTime, lastAccessTime, comment, retention, partitionKeys, tableType, temporary, and more. In addition, hive_table also has connections to hive_column assets and a hive_db asset with "relationship attributes" called columns and db respectively.

The most common type definitions are Entity and Relationship definitions. We'll talk about Relationship type definitions later on but let's dive into the Entity Type.

from pyapacheatlas.typedef import EntityTypeDef

ent_def = EntityTypeDef(
  name = "my_custom_type_name",
  superTypes = ["DataSet"]
)

This is the minimum we need to create a custom type with just the bare bones!

However, it does not include any custom attributes or connections to other entities.

Read on to find out how to customize this type even further.

Adding Attributes to Your Custom Type

If you are going to create a custom type to represent an IBM DB2 database, you might want to capture attributes like lastModifiedDate, hostname, port, prodOrNonProd. Each of these attributes will have their own (meta)type associate with them. They must be one of the primitive types:

You might also introduce a collection of values that could be captured:

Lastly an attribute may be of an "Enumeration" type which has a restricted set of values that are allowed such as "ACTIVE", "DELETED", or "PURGED".

DataSet vs Process

Columns have a required type. Process has inputs and outputs.

For your custom type, you'll also need to choose what you are "inheriting from". Thinking like a programming language, this is just like object oriented inheritance where that parent's attributes are passed down to the child classes.

If we have a type called "Asset" which has attributes called "name" and "description" we can inherit all of those attributes by making "Asset" our custom type's "superType". In addition we can take advantage of "multiple inheritance" and bring in the attributes of multiple types!

You probably don't need to get this advanced as the three most common types you'll use when creating custom types are:

If you're not sure what you want to work with, assume DataSet as your superType.

Deciding on attributes for your type

Having a custom type wouldn't be very interesting unless it enabled you to record information about that type! Consider a database table's column. What might you want to collect about that column?

Each of these must be included as attribute definitions inside your type definition. For each attribute you need to make decisions such as:

In PyApacheAtlas, it provides some smart defaults: * cardinality is single * type is a string * isOptional is set to True (not required).

You can define an attribute with the AtlasAttributeDef class and pass it to the attributes parameter in the EntityTypeDef. Here is a simple and complex example.

from pyapacheatlas.typedef import AtlasAttributeDef, EntityTypeDef

ent_def = EntityTypeDef(
  name = "my_custom_type_name",
  superTypes = ["DataSet"],
  attributes = [
    AtlasAttributeDef(
    name="someAttribute", typeName="string", 
    isOptional=True),
    AtlasAttributeDef(
    name="someIntList", typeName="array<int>", 
    isOptional=True, cardinality="SET",
    valuesMaxCount = 5)
  ]
)

Each of which can be overwritten if necessary but the majority of users tend to just capture a single string that they want to store.

Creating relationships between types

A server type might have a relationship to a database which has a relationship to a schema which has a relationship to a table which has a relationship to a column.

Assuming your custom type is not an isolated, standalone thing, you likely want to connect different entities together through "relationships" in Purview / Atlas.

Every relationship between two entities is its own type! You need to define a RelationshipTypeDef with two "end definitions". A Relationship Type Definition also includes a "Relationship Category" which is set to one of the following: * COMPOSITION: A parent that contains children and the children should not exist without the parent (think a database table as the parent and the columns as the children that must have a table). This is the most common relationship category. * AGGREGATION: A parent that contains children and the children COULD exist without the parent. * ASSOCIATION: Two entities are connected but neither one is a parent or child. This is the least common relationship category.

The start of our relationship definition will look like this:

from pyapacheatlas.typedef import RelationshipTypeDef, TypeCategory

rel_def = RelationshipTypeDef(
  name = "my_relationship_type_name",
  endDef1 = {}, # To be filled in
  endDef2 = {}, # To be filled in
  relationshipCategory = "COMPOSITION"
)

With the category defined, you'll need to think about the two "end definitions". I like to think of "endDef1" as the parent and "endDef2" as the child.

The parent end definition would be considered a "container" of other entities. For example, a custom database type would contain a relationship attribute that contained a set of column entities. Whereas the column entity would have a relationship attribute that points to a "single" table.

The easiest way of doing this in PyApacheAtlas is with the helper classes called ParentEndDef and ChildEndDef. If the defaults presented here are too restrictive, you can use the raw AtlasRelationshipDef class instead.

from pyapacheatlas.typedef import ParentEndDef
from pyapacheatlas.typedef import ChildEndDef

parent = ParentEndDef(
  name="nameOfAttributeContainingColumns",
  typeName="someParentTableType",
  description = "This is the parent end"
)

child = ChildEndDef(
  name="nameOfAttributePointingBackToParentTable",
  typeName="someColumnType",
  description = "This is the child end"
)

Alternatively, you could define this manually, it's just more verbose and you need to know the values to plug in for everything else.

parent = {
  "cardinality" : "SET",
  "description" : "This is the parent end",
  "isContainer" : False,
  "isLegacyAttribute" : False,
  "name" : "nameOfAttributeContainingColumns",
  "type" : "someParentTableType"
}
child = {
  "cardinality" : "SINGLE",
  "description" : "This is the child end",
  "isContainer" : False,
  "isLegacyAttribute" : False,
  "name" : "nameOfAttributePointingBackToParentTable",
  "type" : "someColumnType"
}

The complete script in PyApacheAtlas would look like this:

from pyapacheatlas.typedef import RelationshipTypeDef
from pyapacheatlas.typedef import ParentEndDef
from pyapacheatlas.typedef import ChildEndDef

parent = ParentEndDef(
  name="nameOfAttributeContainingColumns",
  typeName="someParentTableType",
  description = "This is the parent end"
)

child = ChildEndDef(
  name="nameOfAttributePointingBackToParentTable",
  typeName="someColumnType",
  description = "This is the child end"
)

rel_def = RelationshipTypeDef(
  name = "my_relationship_type_name",
  endDef1 = parent,
  endDef2 = child,
  relationshipCategory = "COMPOSITION"
)

Uploading Type Definitions

With our relationship type definition created, we need to actually upload the type definition.

After having authenticated with PyApacheAtlas you can upload using the AtlasClient.upload_typedefs method.

Create a client, authenticate, and call the method. Here's an example using Purview and the entity and relationship definitions defined earlier. Notice how we have parameters for each type category: entityDefs and relationshipDefs each receive a list of the appropriate type definitions.

import json

from pyapacheatlas.core import PurviewClient
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.typedef import ChildEndDef, ParentEndDef, RelationshipTypeDef

auth = ServicePrincipalAuthentication(
  tenant_id = "...",
  client_id = "...",
  client_secret = "..."
)
client = PurviewClient(
  account_name = "PurviewAccountName",
  authentication = auth
)

results = client.upload_typedefs(
  entityDefs = [ent_def],
  relationshipDefs = [rel_def]
)
print(json.dumps(results, indent=2))

And there you have it! You've now defined an entity type and potentially created a relationship between multiple entities.