If you're using Azure Purview or another Apache Atlas flavored data catalog tool, you'll often run into an unsupported data source or tool that you want to catalog. Purview and Atlas provide you with a way to extend the tool with your own custom data source types! Read on to learn more about how you can extend Purview and Atlas with your own custom types.
By the end of this article, you'll feel comfortable writing code to create multiple custom types, relationships between types, and uploading those types to you Purview or Atlas instance.
Apache Atlas and Azure Purview provide a "type" system for your data catalog. That means each asset you have in your data catalog has a type which dictates what sort of attributes you can record and how it connects to other assets.
For example, a hive_table
asset type lets you record attributes like: createTime, lastAccessTime, comment, retention, partitionKeys, tableType, temporary, and more. In addition, hive_table
also has connections to hive_column
assets and a hive_db
asset with "relationship attributes" called columns and db respectively.
The most common type definitions are Entity and Relationship definitions. We'll talk about Relationship type definitions later on but let's dive into the Entity Type.
from pyapacheatlas.typedef import EntityTypeDef
ent_def = EntityTypeDef(
name = "my_custom_type_name",
superTypes = ["DataSet"]
)
This is the minimum we need to create a custom type with just the bare bones!
However, it does not include any custom attributes or connections to other entities.
Read on to find out how to customize this type even further.
If you are going to create a custom type to represent an IBM DB2 database, you might want to capture attributes like lastModifiedDate, hostname, port, prodOrNonProd. Each of these attributes will have their own (meta)type associate with them. They must be one of the primitive types:
You might also introduce a collection of values that could be captured:
array<string>
for a list of strings being stored.map<string, string>
for a dictionary of string keys storing string values.Lastly an attribute may be of an "Enumeration" type which has a restricted set of values that are allowed such as "ACTIVE", "DELETED", or "PURGED".
For your custom type, you'll also need to choose what you are "inheriting from". Thinking like a programming language, this is just like object oriented inheritance where that parent's attributes are passed down to the child classes.
If we have a type called "Asset" which has attributes called "name" and "description" we can inherit all of those attributes by making "Asset" our custom type's "superType". In addition we can take advantage of "multiple inheritance" and bring in the attributes of multiple types!
You probably don't need to get this advanced as the three most common types you'll use when creating custom types are:
DataSet
which is used to represent things like tables, files, reports, or even databases and servers.column
which represents a column or field inside a table, file, or report.Process
which allows for creating custom lineage between assets / entities between DataSet entities.If you're not sure what you want to work with, assume DataSet as your superType.
Having a custom type wouldn't be very interesting unless it enabled you to record information about that type! Consider a database table's column. What might you want to collect about that column?
Each of these must be included as attribute definitions inside your type definition. For each attribute you need to make decisions such as:
In PyApacheAtlas, it provides some smart defaults: * cardinality is single * type is a string * isOptional is set to True (not required).
You can define an attribute with the AtlasAttributeDef
class and pass it to the attributes parameter in the EntityTypeDef
. Here is a simple and complex example.
from pyapacheatlas.typedef import AtlasAttributeDef, EntityTypeDef
ent_def = EntityTypeDef(
name = "my_custom_type_name",
superTypes = ["DataSet"],
attributes = [
AtlasAttributeDef(
name="someAttribute", typeName="string",
isOptional=True),
AtlasAttributeDef(
name="someIntList", typeName="array<int>",
isOptional=True, cardinality="SET",
valuesMaxCount = 5)
]
)
Each of which can be overwritten if necessary but the majority of users tend to just capture a single string that they want to store.
Assuming your custom type is not an isolated, standalone thing, you likely want to connect different entities together through "relationships" in Purview / Atlas.
Every relationship between two entities is its own type! You need to define a RelationshipTypeDef
with two "end definitions". A Relationship Type Definition also includes a "Relationship Category" which is set to one of the following:
* COMPOSITION: A parent that contains children and the children should not exist without the parent (think a database table as the parent and the columns as the children that must have a table). This is the most common relationship category.
* AGGREGATION: A parent that contains children and the children COULD exist without the parent.
* ASSOCIATION: Two entities are connected but neither one is a parent or child. This is the least common relationship category.
The start of our relationship definition will look like this:
from pyapacheatlas.typedef import RelationshipTypeDef, TypeCategory
rel_def = RelationshipTypeDef(
name = "my_relationship_type_name",
endDef1 = {}, # To be filled in
endDef2 = {}, # To be filled in
relationshipCategory = "COMPOSITION"
)
With the category defined, you'll need to think about the two "end definitions". I like to think of "endDef1" as the parent and "endDef2" as the child.
The parent end definition would be considered a "container" of other entities. For example, a custom database type would contain a relationship attribute that contained a set of column entities. Whereas the column entity would have a relationship attribute that points to a "single" table.
The easiest way of doing this in PyApacheAtlas is with the helper classes called ParentEndDef
and ChildEndDef
. If the defaults presented here are too restrictive, you can use the raw AtlasRelationshipDef
class instead.
from pyapacheatlas.typedef import ParentEndDef
from pyapacheatlas.typedef import ChildEndDef
parent = ParentEndDef(
name="nameOfAttributeContainingColumns",
typeName="someParentTableType",
description = "This is the parent end"
)
child = ChildEndDef(
name="nameOfAttributePointingBackToParentTable",
typeName="someColumnType",
description = "This is the child end"
)
Alternatively, you could define this manually, it's just more verbose and you need to know the values to plug in for everything else.
parent = {
"cardinality" : "SET",
"description" : "This is the parent end",
"isContainer" : False,
"isLegacyAttribute" : False,
"name" : "nameOfAttributeContainingColumns",
"type" : "someParentTableType"
}
child = {
"cardinality" : "SINGLE",
"description" : "This is the child end",
"isContainer" : False,
"isLegacyAttribute" : False,
"name" : "nameOfAttributePointingBackToParentTable",
"type" : "someColumnType"
}
The complete script in PyApacheAtlas would look like this:
from pyapacheatlas.typedef import RelationshipTypeDef
from pyapacheatlas.typedef import ParentEndDef
from pyapacheatlas.typedef import ChildEndDef
parent = ParentEndDef(
name="nameOfAttributeContainingColumns",
typeName="someParentTableType",
description = "This is the parent end"
)
child = ChildEndDef(
name="nameOfAttributePointingBackToParentTable",
typeName="someColumnType",
description = "This is the child end"
)
rel_def = RelationshipTypeDef(
name = "my_relationship_type_name",
endDef1 = parent,
endDef2 = child,
relationshipCategory = "COMPOSITION"
)
With our relationship type definition created, we need to actually upload the type definition.
After having authenticated with PyApacheAtlas you can upload using the AtlasClient.upload_typedefs
method.
Create a client, authenticate, and call the method. Here's an example using Purview and the entity and relationship definitions defined earlier. Notice how we have parameters for each type category: entityDefs
and relationshipDefs
each receive a list of the appropriate type definitions.
import json
from pyapacheatlas.core import PurviewClient
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.typedef import ChildEndDef, ParentEndDef, RelationshipTypeDef
auth = ServicePrincipalAuthentication(
tenant_id = "...",
client_id = "...",
client_secret = "..."
)
client = PurviewClient(
account_name = "PurviewAccountName",
authentication = auth
)
results = client.upload_typedefs(
entityDefs = [ent_def],
relationshipDefs = [rel_def]
)
print(json.dumps(results, indent=2))
And there you have it! You've now defined an entity type and potentially created a relationship between multiple entities.