Event Log Imperfection Patterns
Downloads of the original event log imperfection patterns paper:
S. Suriadi, R. Andrews, M. Wynn and A.H.M. ter Hofstede.
Event Log Imperfection Patterns: Towards a Systematic Approach to Cleaning Event Logs. (PDF, 1.29Mb)
Information Systems , 64(1), pages 132-150, Elsevier, 2017.
Introduction
The quality of the data presented to process modeling algorithms is critical to the success of any process mining exercise. Such modelling tools are generally data quality indifferent, i.e. provided the input can be parsed by the tool, its underlying algorithm(s) will generate output. Thus the veracity, realism, precision of the analysis results will be affected by input data quality.
Pre-processing (cleaning) event logs to address quality issues prior to conducting a process mining analysis is a necessary, but generally tedious and ad hoc task. Clearly, a systematic approach to identifying and remedying (commonly occurring) event log data quality issues is desirable.
It turns out that there are many data quality issues commonly found in process mining event logs or encountered while preparing event logs from raw data sources that can be described using patterns. Thus, systematically checking for the signatures of these patterns in an event log can reveal the existence of the associated quality issues and inform appropriate remedial action(s).
In these pages, we present a set of event log imperfection patterns, distilled from our experiences in conducting process mining analyses. The patterns relate to data quality issues that have been observed in multiple event logs (or raw data sources) by multiple practitioners/researchers. We do not claim that this collection of patterns is complete (i.e. captures every possible problem that may afflict an event log). We maintain however, that the patterns provide a way to check for some commonly occurring, and from a process mining perspective, high priority issues, which, when systematically and routinely addressed, provides a ground level quality assurance and allows researchers and practitioners to devote effort to uncovering domain and log specific quality issues.
Background
Event logs used in process mining have a structure designed to allow representation of key attributes of events that occurred over multiple executions of a given process. These event logs may then be presented to a process mining algorithm with the aim of analysing and mapping the process (discovery), determining the extent to which actual execution of the process agrees with the intended execution (conformance) or quantifying various aspects of process behaviour (performance).
Irrespective of the specific type of analysis, it is true that the success of the analysis depends on the quality of the data used in the analysis. In these pages we discuss patterns as a mechanism for describing commonly encountered problems and solutions and introduce our pattern language as a structured method for describing our event log imperfection patterns. We also describe the characteristics of an event log including the key components of an event log and some definitions we will use later when defining our patterns. We also provide some links to existing works that discuss general notions of data quality, and, recognising, that an event log, as a data structure, has some unique features (relating to temporal ordering of records), we describe a data quality framework (J.C. Bose et al., 2013) suitable for event logs that allows us to easily categorise the event log data quality issues to which each of our patterns relate.
Patterns
In A Pattern Language: Towns, Buildings, Construction [AIS77], Christopher Alexander provides the rationale for using patterns as a means of describing problems (and their solutions) commonly faced in our built environment.
"Each pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over without ever doing it the same way twice".
Subsequently, patterns as a means of understanding and communicating the characteristics of an apparently chaotic domain have been adopted in other disciplines including business analysis, object-oriented programming, enterprise architecture, workflow functionality and system security.
In these pages we illustrate the use of patterns as a novel way of considering data quality issues that may be encountered in process mining event logs. We discuss the notion at a conceptual level, seeking to strike a balance between generality and precision, with the aim making it accessible to a broad cross-section of the process mining community. We adopt a textual/graphical representation as a convenient, language and implementation independent representation that promotes awareness and makes the patterns easily understandable without overwhelming the reader with technical considerations. Thus, we describe each pattern as a pattern language that uses a set of components:
- Description: outline of the pattern and how and where the pattern may be introduced into a log
- Affect: consequence of the existence of the pattern on the outcomes of a process mining analysis
- Data Quality Issues: type of data error and the event log entities affected by the pattern
- Manifestation and Detection: strategy to detect the presence of the pattern in a log
- Remedy: how the pattern may be removed from a log
- Side-effects of Remedy: possible, undesirable consequences of application of the remedy
Data Quality Framework
As with other consumed products, the quality of data (used as input for an analysis) can be assessed in terms of it’s fitness for purpose. Data quality is frequently discussed in the literature as a multi-dimensional concept, meaning that multiple facets of a given data object (where data objects include, but are not limited to item, record, database, etc.) need to be considered to determine the quality. Frequently mentioned quality dimensions include, accuracy, relevance, completeness, consistency, reliability, etc. The following are useful links when considering various notions of data quality:
- ISO/IEC 25012 standard aims to define a "general data quality model for data retained in a structured format within a computer system".
- Battini and Scannapieco(2006) provide a comprehensive general data quality framework.
- Wang and Strong (1996) discuss the importance of data quality to consumers and provide a framework, the dimensions of which are still frequently used today. Importantly, the authors validate their dimensions empirically with data consumer input (i.e. using surveys).
Event logs, as data objects peculiar to process mining analyses, have some unique features that ‘standard’ data quality frameworks do not adequately address, thus making these frameworks inadequate to assess event log quality. There are only a few quality frameworks specifically targeted at event logs.
- Process Mining Manifesto(van der Aalst et al.,2011) defines a star-rating (1 to 5) that defines the maturity of the log (readiness for process mining).
- Bose et al., 2013, categorise data quality issues according to the event attribute affected and whether the data is Missing, Incorrect, Imprecise or Irrelevant and the event log (see Table 1).
- Mans et al., 2012, describes log quality as a two-dimensional spectrum with the first dimension concerned with the level of abstraction of the events and the second dimension relating to the accuracy of the timestamp (as measured by granularity, directness of recording and correctness).
We currently adopt the data quality framework proposed by J.C. Bose, R. Mans and W.M.P. van der Aalst as a means of categorising data quality issues that affect event logs.
|
|
Event
Log Entities |
||||||||
Event Log Quality Issues |
|
case |
event |
relationship |
case attrs. |
position |
activity name |
timestamp |
resource |
event attrs. |
Missing data |
I1 |
I2 |
I3 |
I4 |
I5 |
I6 |
I7 |
I8 |
I9 |
|
Incorrect data |
I10 |
I11 |
I12 |
I13 |
I14 |
I15 |
I16 |
I17 |
I18 |
|
Imprecise data |
|
|
I19 |
I20 |
I21 |
I22 |
I23 |
I24 |
I25 |
|
Irrelevant data |
I26 |
I27 |
|
|
|
|
|
|
|
Table 1- Manifestation of quality issues in event log attributes
A description/example of each of these data quality issues may be found here.
Links to Individual Imperfection Patterns
1. Form-based Event Capture
2. Inadvertent Time Travel
3. Unanchored Event
4. Scattered Event
5. Elusive Case
6. Scattered Case
7. Collateral Events
8. Polluted Label
9. Distorted Label
10. Synonymous Labels
11. Homonymous Label