Scientific workflow systems

Jacques Wainer
Department of Computer Science, State University of Campinas, Brazil
wainer@dcc.unicamp.br
Mathias Weske
Wirtschaftsinformatik, University of Muenster, Germany
weske@uni-muenster.de
Gottfried Vossen
Wirtschaftsinformatik, University of Muenster, Germany
vossen@uni-muenster.de
Claudia M Bauzer Medeiros
Department of Computer Science, State University of Campinas, Brazil
cmbm@dcc.unicamp.br
Submitted to NSF Workshop on Workflow and Process Automation

Abstract

This paper will discuss some of the basic assumptions behind office work and scientific work, and show that workflow systems for these two endeavors should have different functionalities. In particular we discuss the idea that data management and management of workflow models are the most important aspects of scientific workflows.

1. Introduction

Workflow management systems (WFMS) have been marketed as systems that control the sequencing of the activities in a procedure (or workflow). At this very abstract level, WFMS could serve both to control the execution of business or office procedures and of scientific procedures. These two families of procedures involve the execution of activities, some of them manually, some of them automatically, and the dependence relationships among these activities can be very complex, yielding complex problems of synchronization of the execution of these activities.

However, this view presupposes that the process being implemented with a WFMS has been modeled in advance, and that, at enactment time, the WFMS is only following what the process model dictates. This is where the rosier view of WFMS breaks down. In real life, both in office and in scientific lab environments, the enactment of a workcase may deviate significantly from what was planned/modeled. In extreme cases, the execution of a workcase may be totally ad hoc. This has given rise to a growing research area, which is concerned with enabling WFMS to help its users deal with these kinds of deviation-from-the-model cases. In this case, a better understanding of what is important in office work and in scientific work is necessary in order to provide the correct functionalities for the WFMS.

In a recent paper, the WASA architecture has been proposed in the context of scientific workflows [MedVW95] (WASA stands for "Workflow-based Architecture to support Scientific Applications"). It discusses properties of scientific work and functionalities that an environment to support scientific work must have in order to be useful. The WASA architecture is currently being used in the context of a scientific workflow in molecular biology, namely in DNA Fragment Assembly. This workflow, the resulting FAT-WASA architecture, and a prototypical implementation of the FAT-WASA system using a business workflow tool are discussed in [MeidVW95].

While [MedVW95] discusses properties of scientific work, and architectures and functionalities for these systems, this paper concentrates on scientific workflow management, its relations to business workflows and ad hoc workflows; it is organized as follows. Section 2 discusses basic properties of workflows in an office environment. Section 3 reviews the basic properties of scientific workflows and relates them to properties of business workflows. There is another kind of workflows discussed in the literature, namely ad hoc workflows. Their relationships to scientific workflows are discussed in Section 4. Concluding remarks complete the paper.

2. Office Work

We propose that office work is mostly about achieving goals defined by rules of enterprises. In more modern WFMS, there has been a lot of emphasis on exception handling and ad hoc planning for special cases [BK95][BW95][BN95][SMM+94]. These two concepts, exception handling and ad hoc planning, show that it is acceptable to neglect a planned processing in order to achieve the goals that were behind the process itself. For example, if the CEO of a company sends a memo that a particular customer's purchase should proceed as urgently as possible, the workflow for that purchase will be changed accordingly. In another example, if a customer is late in providing some documents, the credit checking activity may be postponed and the production scheduling activity may start before credit evaluation, although it should precede it, if this change is approved by someone at the appropriate level in the organization.

These examples show that in office work situations what is important is not follow the rules but "to get the job done" or even better, to achieve what doing the job would achieve, possibly in a more efficient way than planned. This, in the business workflow literature is called "exception handling" or "situated planning".

3. Scientific Work

Whereas office work is about goals, scientific work is about data. Collecting, generating, and analyzing large amounts of heterogeneous data is the essence of such work, or at least of the components of scientific work that are more naturally the target of WFMS. Gathering and merging data from various experiments, generating data from a computer model or performing statistical analyses in the data, are among the activities that could profit from WFMS support.

The degree of flexibility that scientists have in their work is usually much higher than in the business domain, where business processes are usually predefined and executed in a routine fashion. Assume a scientist decides to filter a data set coming from a measuring device; even if such filtering was not planned for, that is a perfectly acceptable attitude, provided the resulting data is tagged as being the output of the filtering activity.

This example shows what we believe is the most important characteristics of a scientific workflow: as a way of identifying data sets. The details and parameters of a workflow should be added to the data set in order to identify the data. Thus, by accessing this identification tag on a data set the scientist would know how the data was generated (devices, algorithms, time, place), which data manipulation activities were performed.

While flexibility is a major property of scientific work, there are numerous standard procedures that can be assembled to perform complex scientific experiments. An important difference from business workflows is that a scientific workflow is often not completely defined before it starts. The scientist performs some tasks and decides on the further steps only after evaluating the previous ones. These sequences of steps that make up part of a scientific experiment are known as partial workflows. Partial workflows may be re-used in later experiments. Therefore it is obvious that managing partial workflows is an essential goal in scientific workflow management.

The above illustrates that workflow systems can prove invaluable in helping activity tracking, data tagging and documentation, even for experiments performed by a single scientist. This is particularly true for scientists working on computational models; they generate large amounts of data, each produced by changing different parameters in the computer models, that must be properly identified.

There is one other aspect that distinguishes office from scientific workflows: an office workflow must be brought to a "satisfying" end. If a customer cancels his order, that purchase case must be further processed to be brought to an acceptable end state: the production may be rescheduled, the organization may sue the customer for expenses or for breaking the contract. In a scientific workflow, cases may be abandoned at any moment and at any step. If the scientist thinks that there was some contamination of the data, an experiment may be just stopped.

4. Ad hoc flow

We will call ad hoc flow the possibility of altering the flow of a particular workcase. Because of its particular characteristics a workcase may have to follow a different sequencing of activities from the one planned for more standard workcases of that type.

There are different forms of ad hoc flow discussed in the recent literature: ad hoc planning is the case where a particular actor in the workflow may alter the plan of activities of the workcase [BK95][BN95][SMM+94]. This re-planning can be restricted to a particular domain in the organization: the credit checking department proposes a different plan for this workcase because it is in some way a special case, but outside the credit checking department the case will proceed as planned. Or the re-planning can affect the whole plan for the workcase.

A different form of ad hoc flow can be described as "pass the buck." In this case, the flow of the workcase is not planned but at the end of each activity the actor decides to whom the workcase should be sent next. Ad hoc planing has been discussed in the literature as an important way of dealing with specificities of a workcase. The pass the buck mode is discussed by [WB96].

In scientific work, both forms of ad hoc flow seem very important: a group may replan a certain sequence of activities because of characteristics of the data, or because the scientists want to try a new data analysis procedure. Or a solitary scientist may not even replan in advance, but given the results of an activity, decide what to do next. One can see that because the WFMS will manage the data, the scientist may find it more interesting to describe to the WFMS what activity should be performed next instead of just doing it, since in the latter case she would have to manually attach to the resulting data the information on what activity and parameters was used to generate it.

Scientific workflows should also provide another functionality for ad hoc planning, which we call rewind. A scientist may decide after performing a sequence of data analysis activities (say high frequency filtering and outline removal) that a different form of data analysis should have been performed (say principal component analysis). The scientist should be able to rewind the workcase to a step previous to this data analysis sequence and from there perform the alternative data analysis procedure.

The rewind concept should not be confused with the redo concept, which is common in office and software engineering workflows. In a redo, the flow of the workcase is redirected so that a particular activity is executed again. The difference is that in the redo all data additions performed by the subsequent activities are available for the redone activity. For example, in software production workflows it is common to have loops of code/test where the code activity is redone if the test activity detects errors. The code activity has access to all the test results, and in fact depends on it. If it were rewound, the code activity would start again, from the specifications, with no data from later activities available. The rewind functionality is of course based on versioning the data produced by the activities. One would like to be able to restore the full context after (or before) some activity was performed and proceed with another course of actions.

5. Conclusion

The objective of this paper was to exhibit differences between classes of workflows: scientific, office, and ad-hoc ones. Our goal is to contribute to a better understanding of what targets exist for workflow management systems. The main focus of this paper, however, is on scientific workflows. We showed that these do have special properties, such as identifying and tagging data sets, versioning data sets, allowing to rewind workflows. This emphasis is currently being implemented in the WASA prototype; it is not available in any commercial workflow management system yet.

Acknowledgements

This work was partially supported by CNPq Brazil and by the German Ministry of Science and Technology (BMFT), within a bilateral cooperation on Database Technology and Knowledge-Based Systems, and by FAPESP Brazil.

References

[BK95] D.P. Bogia and S.M. Kaplan, Flexibility and Control for Dynamic Workflows in the wOrlds Environment, in Proceedings of the 1995 ACM Conference on Organizational Computing Systems (COOCS'95), N. Comstock and C.A. Ellis (eds.), pp 148-159, Milpitas, California, 1995.

[BN95] R. Blumenthal and G.J. Nutt, Supporting Unstructured Workflow Activities in the Bramble ICN System, in Proceedings of the 1995 ACM Conference on Organizational Computing Systems (COOCS'95), N. Comstock and C.A. Ellis (eds.), pp 130-137, Milpitas, California, 1995.

[BW95] P. Barthelmess and J. Wainer, Workflow Systems: a few definitions and a few suggestions, in Proceedings of the 1995 ACM Conference on Organizational Computing Systems (COOCS'95), N. Comstock and C.A. Ellis (eds.), pp 138-147, Milpitas, California, 1995.

[MedVW95] C. B. Medeiros, G. Vossen, M. Weske: WASA: A Workflow-Based Architecture to Support Scientific Database Applications (Extended Abstract). Proceedings of the 6th DEXA Conference (eds.: N. Revell, A. M. Tjoa), Springer LNCS 978, pp 574-583, London 1995.

[MeidVW95] J. Meidanis, G. Vossen, M. Weske: Using Workflow Management in DNA Sequencing. Fachbericht Angewandte Mathematik und Informatik 23/95-I, Universität Münster, 1995.

[SMM+94] K.D. Swenson and R.J. Maxwell and T. Matsumoto B. Saghari and K. Irwin, A Business Process Environment Supporting Collaborative Planning, in CSCW'94, ACM, 1994.

[WP96] J. Wainer and P. Barthelmess Workcase-centric Workflow Model Submitted to NSF Workshop on Workflow and Process Automation, 1996.