Skip to main content
Version: 0.10.4

Business Glossary

Certified

This plugin pulls business glossary metadata from a yaml-formatted file. An example of one such file is located in the examples directory here.

CLI based Ingestion

Install the Plugin

pip install 'acryl-datahub[datahub-business-glossary]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: datahub-business-glossary
config:
# Coordinates
file: /path/to/business_glossary_yaml
enable_auto_id: true # recommended to set to true so datahub will auto-generate guids from your term names

# sink configs if needed

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
file 
One of string, string(path)
File path or URL to business glossary file to ingest.
enable_auto_id
boolean
Generate guid urns instead of a plaintext path urn with the node/term's hierarchy.
Default: False

Business Glossary File Format

The business glossary source file should be a .yml file with the following top-level keys:

Glossary: the top level keys of the business glossary file

Example Glossary:

version: 1                                              # the version of business glossary file config the config conforms to. Currently the only version released is `1`.
source: DataHub # the source format of the terms. Currently only supports `DataHub`
owners: # owners contains two nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
url: "https://github.com/datahub-project/datahub/" # (optional) external url pointing to where the glossary is defined externally, if applicable
nodes: # list of child **GlossaryNode** objects. See **GlossaryNode** section below
...

GlossaryNode: a container of GlossaryNode and GlossaryTerm objects

Example GlossaryNode:

- name: Shipping # name of the node
description: Provides terms related to the shipping domain # description of the node
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
nodes: # list of child **GlossaryNode** objects
...
knowledge_links: # (optional) list of **KnowledgeCard** objects
- label: Wiki link for shipping
url: "https://en.wikipedia.org/wiki/Freight_transport"

GlossaryTerm: a term in your business glossary

Example GlossaryTerm:

- name: FullAddress # name of the term
description: A collection of information to give the location of a building or plot of land. # description of the term
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
term_source: "EXTERNAL" # one of `EXTERNAL` or `INTERNAL`. Whether the term is coming from an external glossary or one defined in your organization.
source_ref: FIBO # (optional) if external, what is the name of the source the glossary term is coming from?
source_url: "https://www.google.com" # (optional) if external, what is the url of the source definition?
inherits: # (optional) list of **GlossaryTerm** that this term inherits from
- Privacy.PII
contains: # (optional) a list of **GlossaryTerm** that this term contains
- Shipping.ZipCode
- Shipping.CountryCode
- Shipping.StreetAddress
custom_properties: # (optional) a map of key/value pairs of arbitrary custom properties
- is_used_for_compliance_tracking: true
knowledge_links: # (optional) a list of **KnowledgeCard** related to this term. These appear as links on the glossary node's page
- url: "https://en.wikipedia.org/wiki/Address"
label: Wiki link
domain: "urn:li:domain:Logistics" # (optional) domain name or domain urn

To see how these all work together, check out this comprehensive example business glossary file below:

Example business glossary file
version: 1
source: DataHub
owners:
users:
- mjames
url: "https://github.com/datahub-project/datahub/"
nodes:
- name: Classification
description: A set of terms related to Data Classification
knowledge_links:
- label: Wiki link for classification
url: "https://en.wikipedia.org/wiki/Classification"
terms:
- name: Sensitive
description: Sensitive Data
custom_properties:
is_confidential: false
- name: Confidential
description: Confidential Data
custom_properties:
is_confidential: true
- name: HighlyConfidential
description: Highly Confidential Data
custom_properties:
is_confidential: true
domain: Marketing
- name: PersonalInformation
description: All terms related to personal information
owners:
users:
- mjames
terms:
- name: Email
## An example of using an id to pin a term to a specific guid
## See "how to generate custom IDs for your terms" section below
# id: "urn:li:glossaryTerm:41516e310acbfd9076fffc2c98d2d1a3"
description: An individual's email address
inherits:
- Classification.Confidential
owners:
groups:
- Trust and Safety
- name: Address
description: A physical address
- name: Gender
description: The gender identity of the individual
inherits:
- Classification.Sensitive
- name: Shipping
description: Provides terms related to the shipping domain
owners:
users:
- njones
groups:
- logistics
terms:
- name: FullAddress
description: A collection of information to give the location of a building or plot of land.
owners:
users:
- njones
groups:
- logistics
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://www.google.com"
inherits:
- Privacy.PII
contains:
- Shipping.ZipCode
- Shipping.CountryCode
- Shipping.StreetAddress
related_terms:
- Housing.Kitchen.Cutlery
custom_properties:
- is_used_for_compliance_tracking: true
knowledge_links:
- url: "https://en.wikipedia.org/wiki/Address"
label: Wiki link
domain: "urn:li:domain:Logistics"
knowledge_links:
- label: Wiki link for shipping
url: "https://en.wikipedia.org/wiki/Freight_transport"
- name: ClientsAndAccounts
description: Provides basic concepts such as account, account holder, account provider, relationship manager that are commonly used by financial services providers to describe customers and to determine counterparty identities
owners:
groups:
- finance
terms:
- name: Account
description: Container for records associated with a business arrangement for regular transactions and services
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
inherits:
- Classification.HighlyConfidential
contains:
- ClientsAndAccounts.Balance
- name: Balance
description: Amount of money available or owed
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Balance"
- name: Housing
description: Provides terms related to the housing domain
owners:
users:
- mjames
groups:
- interior
nodes:
- name: Colors
description: "Colors that are used in Housing construction"
terms:
- name: Red
description: "red color"
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"

- name: Green
description: "green color"
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"

- name: Pink
description: pink color
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
terms:
- name: WindowColor
description: Supported window colors
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
values:
- Housing.Colors.Red
- Housing.Colors.Pink

- name: Kitchen
description: a room or area where food is prepared and cooked.
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"

- name: Spoon
description: an implement consisting of a small, shallow oval or round bowl on a long handle, used for eating, stirring, and serving food.
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
related_terms:
- Housing.Kitchen
knowledge_links:
- url: "https://en.wikipedia.org/wiki/Spoon"
label: Wiki link

Source file linked here.

Generating custom IDs for your terms

IDs are normally inferred from the glossary term/node's name, see the enable_auto_id config. But, if you need a stable identifier, you can generate a custom ID for your term. It should be unique across the entire Glossary.

Here's an example ID: id: "urn:li:glossaryTerm:41516e310acbfd9076fffc2c98d2d1a3"

A note of caution: once you select a custom ID, it cannot be easily changed.

Compatibility

Compatible with version 1 of business glossary format. The source will be evolved as we publish newer versions of this format.

Code Coordinates

  • Class Name: datahub.ingestion.source.metadata.business_glossary.BusinessGlossaryFileSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Business Glossary, feel free to ping us on our Slack.