A long time ago in a glaxy far far away… 🙂
… Someone asked me what I thought ‘good’ Data Governance might look like for a given data platform – warehouse or analytics solution. This included all aspects of data governance is was a very broad question.
To add some actual context, not related to Star Wars, this was during the time when Microsoft started talking about version 2 of the Azure Data Catalogue offering and what else could/should be included in a suite of governance tools. Certainly, long before the days of what we now call Azure Purview. It was also a time when GDPR had a lot of focus for the data community as we battled with the principals and technical implications.
Anyway, with a recent fresh perspective on things, I’ve decided to dust off my original notes and attempt to distill the answer of good data governanace into the following areas. What I also find with data governanace that the more I experience in the industry the deeper my perspective on the subject goes.
My 10 headings for good data governance (in no particular order) are:
- Master Data Management (MDM)
- Environment Management
- Change Management
- Owners & Stewards
Sadly, data governance isn’t really an exciting topic (in my opinion). However, it does require a lot of thinking and can be very subjective. That said, I hope that the following generic statements will later help inform an actual technical design, including a lot of innovation with the eventual Azure resources used to deliver some future solution.
You’ll also probably know if you follow me, blogging helps me process my thoughts and I like using posts as a form of public reference. As Scott Hanselman once said, why just reply with the answer to the person that asked the question, when you can write a blog about the answer that lots of people might benefit from.
Grab a cuppa… Let’s dig in.
- Environment Classification – what classifications exist for the data platform environments and what are the implications of each classification. The classifications should inform privacy, security, public and none public roles for the environment.
- Dataset Classification – at a lower level to the environment what dataset classifications exist and how are these datasets handled during processing, at both ingestion time, transformation and consumption. Classifications should inform both functional and none functional requirements for the dataset handling.
- Attribute Classification – at the lowest level, are all attributes classified in a given data platform and how granular are those classifications. Can multiple classifications apply. The more granular the classification, generally the better the handling rules should be. A simple, coarse classification of sensitive vs none sensitive for attributes is often not good enough for most data analytics and regulatory requirements. At the attribute level we should be informing what the data is for, why we are storing it and how.
- Storage Taxonomy – for the various data platform processing layers, has a concise container/folder structure been defined to support the storage of datasets as they are passed through the solution(s). If defined, does the taxonomy offer expansion as required, versioning of datasets and consider outlier use cases.
- Ingestion Process Handling – how is data handled at ingestion time given its classification and particularly if that classification is sensitive. For example, are values simply nullified if sensitive or simply removed from the dataset as an available column.
- Attribute Anonymization – does attribute anonymization exist and if so, how is it applied for different data values. For example, if a credit card number, is the value fully or only partly redacted. If a postcode, are only the first 4 characters preserved in the data storage.
- Sensitive Data Store – for sensitive data, where there is a requirement for it to be stored as part of a data platform solution, how is it stored. Does this exist in a dedicated privacy locker, isolated from the main (none sensitive) datasets. If so, what practices exist for the input and output of data in that privacy locker and how is the data connected. Can sensitive data become orphaned.
- Privacy Policies – for personal data captured about customers, what privacy policies exist to inform the user how their data is handled, used and even shared.
- Regulatory Procedures – for regulations like GDPR, where users can invoke rights to be forgotten and rights to receive all data that has been captured about them. What procedures exist to support these practices, as both technical and none technical workflows.
- Row Level Auditing – during every stage of data processing are records appended with a standard set of auding attributes. If so, what are they. Examples of these attributes could be; source system, updated date/time, update by, country of origin. Then if a record is manually updated, is this captured directly in the dataset with a user ID.
- Process Execution Flow – across different technologies in a given set of processing pipelines can data lineage be tracked for a dataset or attribute. Can this be represented as a graphical execution chain that can be explored and shared with other supplementary metadata.
- Attribute Tracking – in situations invovling schema drift how is the versioning of attributes handled. Are all changes persisted within base tables with a layer of views that abstract common (external) attribute naming conventions. This could also be referred to a row level data lineage.
- Consumer Portal – for a given solution, is all data cataloged and made available as a data dictionary of items and attributes. Here attributes should offer metadata information containing example contents, data type, classification, usage.
- Profiling – what data profiling practices exist to offer supplementary metadata on a given source system or dataset. Is this profiling information stored, refreshed regularly and made available as part of the wider data dictionary.
- Entity Register – does a data entity register (DER) exist offering a business set of domains areas, definitions and standards for all datasets handled. This should be source system agnostic and be backed my clear lines of authoritive ownership.
- Discovery – how is new data onboarded into the wider data cataloguing initiative. If done by automated agents, scrapping known data storage repositories, is this output cleansed before being made available for consumption. In none automated cases, can users self-catalogue data as found (discovered) in disparate sources.
- Model Diagrams – for complex data models does a entity relationship diagram exist to inform both technical and none technical audiences. For example, in a Kimble data warehouse, which dimenions have relationships to a given fact table.
- Conformity – what rules exist for datasets and when are they applied during the curation of data into a given solution. For example, source systems might record monetary values in localized currencies. If so, are these values conformed into a single currency during data curation meaning values are then comparable across sources and business regions.
- Cleansing – what data cleansing rules exist and how are they applied. For example, raw data is ingested, then for all attributes with a data type of string, trimming functions are applied to remove white space at the beginning and the end of the values. For a given data cleaning process, how dynamically can rules be create/updated.
- Rule Creation – does a processes or software package exist to capture data quality rules. Does this include a basic set of rules as well as the ability to implement custom/domain applicable rules. For example, phone numbers can be assumed as being a fixed length of characters in a given region. This rule is basic and could be extended to also include conditions related to country codes that are added to a phone number as well as context conditions for different countries.
- Data Quality Assessments – for data quality scores produced from a set of rules being applied, how are those scores assessed, targeted and improved. Reporting poor data quality is meaningless if no follow up actions and workflows exist to correct the issue.
- Record Verification – once data has been captured and persisted to storage from internal data sources, do additional processes exist to verify the values for accuracy. If so, what are they and how are conflicts handled. For example, a customer declares they have no penalties on their driving license via unvalidated data entry processes. Does a workflow exist to confirm this information is correct with the respective driving authority.
Master Data Management (MDM)
- Reference Data Consolidation – in a given environment, business function or wider solution is reference data consolidated into a single store. This could be described as a reference data warehouse. The purpose of such a store is to allow reference data refinement and act as an authority for serving other master data workflows.
- Reference Data Translation – when multiple data sources offer a version of the same reference data. What process exists for those records to be consolidated and mapped into a single version for a given record set. For example, gender, system 1 reports ‘male’, system 2 reports ‘M’. The values mean the same thing and one can be translated into the other. What process supports this and how are internal/external values consolidated.
- Consumable API’s – for master data to remain current it has to be easily accessible to the wider community of data consumers/data products. If a solution can’t easily access master data information, progress will mean they ultimately make their own version of it, resulting in the failure of the master data concept.
- Stale Data Recycling – often for a given solution data is loaded into development and test environments. This can come from equivalent development and test source systems, or from a production data source where environments aren’t aligned in the wider end to end process. In either case, its common for development and test data become stale and disconnected, espically if loaded from an upstream production source. This can result it poor data quality and incorrect values for personal data that frequently change. A process should therefore exist to clear down none production environments to ensure data doesn’t stagnate.
- Development Practices – for certain development processes production data is required to ensure model accuracy and identify data outliers. Where this requirement exists, internal processes should ensure the production data in development environments is restricted and removed once the given development work stream is complete. Similar to the above point about stale data, but aligned to development cycles. Develop > Test > Deploy > Delete Development Data.
- Model Management – in machine learning and other AI practices, data science models can be trained and used for record scoring or other predictive workloads. Over time those models degrade as the data they were trained with becomes out dated. A process of model re-training should therefore exist to ensure data accuracy remains high and model outputs stay current. For example, a model that predicts the fuel efficiency of cars, but was trained using vehicle data from 1990. Now when new car information is entered its highly likely that the fuel efficiency prediction will be incorrect.
- Data Life Cycle and Archiving – in various database technologies and storage layers what period/duration is data kept for and how is old data removed completely or moved to cold storage accounts. How is ‘old’ defined given the natural of the dataset and if archived data needs to be re-hydrated how seamlessly does the platform support this operation. Hot data vs cold data vs re-hydrated data should have clear workflows and business rules applied.
- Reload and Reconcile – Where required, technical design and solution architecture should support the requirements to reload datasets and reconcile processing data with functionality baked into the platform.
- Auditing – where required for governing bodies and to meet legal regulations is data auditing done and if so how. This could include mandated audit requests as well as platform baked in auditing that offering alerting for data changes made outside of expected channels.
- Data Source Upgrades – when, not if, a data source updates its system, what are the implications of this for solutions downstream of the data source. Are change management assessments done to ensure updates to data schemas are captured and handled, ultimately avoiding data processing failures. Or, does the source system offer an abstracted API layer for data access meaning a consistent view of datasets is always presented meaning upstream changes are disconnected/handled internally by the source system.
- Schema Drift – for changes in dataset schema, are these changes permitted as part of the ingestion data processing and beyond. If so, how are the ultimate schema updates captured in the output datasets. In all cases, is it only schema additions that can be softly handled and schema item removals or changes will always result in processing failure.
- Attributes Renaming – what levels of abstract exist in a given data processing chain to avoid processing failure for attribute headers that get renamed. In a fixed column or delimited dataset does the header label matter during processing or only the index of the column. If done based on simple column order how does this work alongside schema drift processes.
- Data Types Changes – given data type precedence rules and implicit conversion what data type changes are automatically handled during data processing compared to those that require intervention (from a data engineer). For example, reducing the precision on a decimal attribute from 10 decimal places, to 3 decimal places might not result in a processing failure, but could have serious implications for values that become rounded off, especially when aggregated across many millions of rows.
- Encryption – for all data processing routines is the encrypted applied end to end, from source, during processing, at rest, in transit and when consumed by a client. It is required, and if so how are encryption keys handled.
- Permissions Models – in a given environment security model is this orientated around data classification or does a simple hierarchy exist where senior roles get access to sensitive data by default. For example, even the CEO should have to follow a process of justification before being granted access to sensitive data. It shouldn’t be given by default to any role. Data access needs to be justified as needed for a role.
- Key Recycling and Handling – supporting data encryption and other automated processes often requires service accounts with dedicated keys (secrets). Practices for storing these keys centrally should exist as well as wider operational environment processes that result is keys periodically expiring and being recycled. For example, user passwords expire and have to be changed, the same concepts should apply to service accounts and automated process authentication.
- Row Level Security – for business critical datasets semantic layer perspectives and row level security permissions should exist meaning only partial record sets can be viewed by a given user. For example, a national retailer, has a complete dataset of all sales. However, an area manager is only able to view records applicable to their territory.
- Hashing Algorithms – where sensitive data needs to be aggregated, but without the actual sensitive data values being exposed, what processes for hashing, re-hashing and un-hashing the values exist, with a clearly defined standard for masking and documenting record salts.
Owners & Stewards
- Accountable Domains – for a given business area are there clear channels of authority regarding ownership of datasets and what is the level of local/national/regional, or regulatory accountability. For example, in the event of a data breach, who is responsible for the legal implications.
- Access Approval – for solutions and datasets how are access requests handled for new and existing users and do workflows exist to ensure access requests are justified for a given data consumer requirements. This should include automated processes that move data outside of the data owners normal remittance.
- Consumer Persona’s – data consumption personas allow for an informed security model and associated technical requirements in a given solution for anyone wanting access to data. Having a clearly defined set of persona’s improves business handling and offers an element of auditing when investigating issues, assuming read vs write access is limited to only the essentials for a persona.
- Stewards – for certain business functions and processes are data stewards appointed as a means of ownership delegation for request handling and to improve overall data management. A steward may also handle reference data mappings. If available, what roles do these stewards perform when, for example, improving data quality. Or handling master data translation.
As with all my blogs posts when I’m simply sharing my thoughts on a topic or technology I’d welcome your feedback.
Do you agree/disagree? Have I missed anything? What next?
Many thanks for reading