The quality of data one gets out of a discovery process is extremely important for everything one does with that data. Bad data means bad decisions and acceptance for systems making use of the data. Examples we saw in the articles and videos on CMDB integration and security process and operations were always stressing this point.
But what is data quality really? Which qualities and what do they mean in reality? In this article and video we will have a closer look on the need, the dimensions of quality and concrete examples what is means in reality.
Importance of Data Quality
The users of a discovery system are other tools such as a CMDB foremost, but also IT and security monitoring tools and more. These clients rely on the discovery having done a good job to identify, classify and relate resources in its object model. They cannot easily fix defects in the discovery data or if, then with high often manual effort. Clearly this needs to be avoided. Therefore beyond the functionality, the quality of the data provided by a discovery tool needs major attention.
A lot of software always focuses on the functionality of some component. But its qualities are equaly important. Quality can be defined as:
Quality is the standard of something as measured against other things of a similar kind; the degree of excellence of something.
Qualities can be grouped in a quality tree, e.g. for software quality there is a quality tree defined in ISO 25010. Such a quality tree helps to organize the different kinds of qualities into categories.
In this article we are looking into quality of data instead of software and are focusing on the principal categories only and the purpose of network discovery.
The following 6 qualities are essential for discovery:
We will have a look at each quality in the following sections.
The timeliness quality answers the questions “Is the data available when you need it?”. For discovery this means that the information in the data model has to be always-up-to-date and accessible. Being up-to-date means that discovery needs to constantly discover the assets in the managed environment, or at least as often as possible and feasible.
For this to work the disccovery needs to be scheduled often and run fast while minimizing the impact on the discovered environment itself. An efficient and intelligent discovery process is required to satisfy this quality.
A high level of parallelism will be able to discover more devices and details per hour but should not exhaust the discovery server, so it needs to be tunable. Different network zone will require different level of up-to-dateness and therefore some flexibility in using different discovery jobs for these is critical. The schedule should be configurable to the needs of the business and allow also blackout time, when e.g. a backup already saturates the network.
Accuracy is the quality for the question “How well does a piece of data reflect the reality?” In the case of discovery, is the data in the discovery model consistent with the state and details of the real assets? Discovery should not produce invalid or wrong information as many other systems, such as CMDBs will depend on it. Having high accuracy at the source and that is what discovery is, is the best way to achieve high quality in the end.
A discovery tool should be able to self-qualitfy its data quality and give recommendations and explanations for these to the user, so that he can help improve the information. This could be for example to add more credentials for protocols or systems or modify some configuration setting of a device.
Completeness relates to the question “Does it fulfill your expectations of what is comprehensive?”. Having transparency on the completeness of the list of systems on the one hand and the amount of details discovered for each device is critical.
Otherwise, as we have seen in https://blog.jdisc.com/2020/12/04/discovery-for-operational-security-audits/ a missing discovered asset can have a significant impact. A discovery tool should therefore have good usable diagnostic and troubleshooting tools to find unidentified devices, issues in discovery though protocols or access to devices, parsing errors or even duplicate device names.
The quality dimension consistency should answer the question “Does data stored in one place match relevant data stored elsewhere?”. A discovery service is the basis for a CMDB and information in both should be consistent as well as information discoverd from two discovery servers (e.g. firewall between a DMZ and the enterprise Intranet).
Validity answers the question “It the information in a specific format, does it follow business rules or it is in an unusable format?”. For the discovery tool this is the question of how it stores its information and how users and external systems can access the data.
The object model of a discovery tool should provide a consistent way how its information is represented, so that other solutions can rely on its information quality.
The uniqueness quality dimension answers the question “Is this the only instance in which this information appears in the database?”. This might sound like a strange question, but is very important and relates to the process of normalization.
In the real world, the same asset is captured by multiple systems and each system can identify the asset and its properties with a slightly different identifier or name. For a human being this is usually not a problem as our brain automgically recognizes that both assets are indeed the same. A computer is not able to do this as easy as we do.