Event Log Imperfection Patterns

Downloads of the original event log imperfection patterns paper:

S. Suriadi, R. Andrews, M. Wynn and A.H.M. ter Hofstede.
Event Log Imperfection Patterns: Towards a Systematic Approach to Cleaning Event Logs. (PDF, 1.29Mb)
Information Systems , 64(1), pages 132-150, Elsevier, 2017.

Introduction

The quality of the data presented to process modeling algorithms is critical to the success of any process mining exercise. Such modelling tools are generally data quality indifferent, i.e. provided the input can be parsed by the tool, its underlying algorithm(s) will generate output. Thus the veracity, realism, precision of the analysis results will be affected by input data quality.

Pre-processing (cleaning) event logs to address quality issues prior to conducting a process mining analysis is a necessary, but generally tedious and ad hoc task. Clearly, a systematic approach to identifying and remedying (commonly occurring) event log data quality issues is desirable.

It turns out that there are many data quality issues commonly found in process mining event logs or encountered while preparing event logs from raw data sources that can be described using patterns. Thus, systematically checking for the signatures of these patterns in an event log can reveal the existence of the associated quality issues and inform appropriate remedial action(s).

In these pages, we present a set of event log imperfection patterns, distilled from our experiences in conducting process mining analyses. The patterns relate to data quality issues that have been observed in multiple event logs (or raw data sources) by multiple practitioners/researchers. We do not claim that this collection of patterns is complete (i.e. captures every possible problem that may afflict an event log). We maintain however, that the patterns provide a way to check for some commonly occurring, and from a process mining perspective, high priority issues, which, when systematically and routinely addressed, provides a ground level quality assurance and allows researchers and practitioners to devote effort to uncovering domain and log specific quality issues.

Background

Event logs used in process mining have a structure designed to allow representation of key attributes of events that occurred over multiple executions of a given process. These event logs may then be presented to a process mining algorithm with the aim of analysing and mapping the process (discovery), determining the extent to which actual execution of the process agrees with the intended execution (conformance) or quantifying various aspects of process behaviour (performance).

Irrespective of the specific type of analysis, it is true that the success of the analysis depends on the quality of the data used in the analysis. In these pages we discuss patterns as a mechanism for describing commonly encountered problems and solutions and introduce our pattern language as a structured method for describing our event log imperfection patterns. We also describe the characteristics of an event log including the key components of an event log and some definitions we will use later when defining our patterns. We also provide some links to existing works that discuss general notions of data quality, and, recognising, that an event log, as a data structure, has some unique features (relating to temporal ordering of records), we describe a data quality framework (J.C. Bose et al., 2013) suitable for event logs that allows us to easily categorise the event log data quality issues to which each of our patterns relate.

Patterns

In A Pattern Language: Towns, Buildings, Construction [AIS77], Christopher Alexander provides the rationale for using patterns as a means of describing problems (and their solutions) commonly faced in our built environment.

"Each pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over without ever doing it the same way twice".

Subsequently, patterns as a means of understanding and communicating the characteristics of an apparently chaotic domain have been adopted in other disciplines including business analysis, object-oriented programming, enterprise architecture, workflow functionality and system security.

In these pages we illustrate the use of patterns as a novel way of considering data quality issues that may be encountered in process mining event logs. We discuss the notion at a conceptual level, seeking to strike a balance between generality and precision, with the aim making it accessible to a broad cross-section of the process mining community. We adopt a textual/graphical representation as a convenient, language and implementation independent representation that promotes awareness and makes the patterns easily understandable without overwhelming the reader with technical considerations. Thus, we describe each pattern as a pattern language that uses a set of components:

Description: outline of the pattern and how and where the pattern may be introduced into a log
Affect: consequence of the existence of the pattern on the outcomes of a process mining analysis
Data Quality Issues: type of data error and the event log entities affected by the pattern
Manifestation and Detection: strategy to detect the presence of the pattern in a log
Remedy: how the pattern may be removed from a log
Side-effects of Remedy: possible, undesirable consequences of application of the remedy

Data Quality Framework

As with other consumed products, the quality of data (used as input for an analysis) can be assessed in terms of it’s fitness for purpose. Data quality is frequently discussed in the literature as a multi-dimensional concept, meaning that multiple facets of a given data object (where data objects include, but are not limited to item, record, database, etc.) need to be considered to determine the quality. Frequently mentioned quality dimensions include, accuracy, relevance, completeness, consistency, reliability, etc. The following are useful links when considering various notions of data quality:

ISO/IEC 25012 standard aims to define a "general data quality model for data retained in a structured format within a computer system".
Battini and Scannapieco(2006) provide a comprehensive general data quality framework.
Wang and Strong (1996) discuss the importance of data quality to consumers and provide a framework, the dimensions of which are still frequently used today. Importantly, the authors validate their dimensions empirically with data consumer input (i.e. using surveys).

Event logs, as data objects peculiar to process mining analyses, have some unique features that ‘standard’ data quality frameworks do not adequately address, thus making these frameworks inadequate to assess event log quality. There are only a few quality frameworks specifically targeted at event logs.

Process Mining Manifesto(van der Aalst et al.,2011) defines a star-rating (1 to 5) that defines the maturity of the log (readiness for process mining).
Bose et al., 2013, categorise data quality issues according to the event attribute affected and whether the data is Missing, Incorrect, Imprecise or Irrelevant and the event log (see Table 1).
Mans et al., 2012, describes log quality as a two-dimensional spectrum with the first dimension concerned with the level of abstraction of the events and the second dimension relating to the accuracy of the timestamp (as measured by granularity, directness of recording and correctness).

We currently adopt the data quality framework proposed by J.C. Bose, R. Mans and W.M.P. van der Aalst as a means of categorising data quality issues that affect event logs.

		Event Log Entities
Event Log Quality Issues		case	event	relationship	case attrs.	position	activity name	timestamp	resource	event attrs.
	Missing data	I1	I2	I3	I4	I5	I6	I7	I8	I9
	Incorrect data	I10	I11	I12	I13	I14	I15	I16	I17	I18
	Imprecise data			I19	I20	I21	I22	I23	I24	I25
	Irrelevant data	I26	I27

Table 1- Manifestation of quality issues in event log attributes

A description/example of each of these data quality issues may be found here.