Publication
ETH Zürich, Diss. Nr. 79600, January 2008
Supervised by: Prof. Gustavo Alonso
Supervised by: Prof. Gustavo Alonso
Today, workflow languages are widely used for service composition. Workflow and
business process management (BPM) systems are typically based on a step-by-step
execution model where a task is started, the result is received, and then the next
task is scheduled for execution in a similar fashion. To track the execution of individual
service invocations and of the overall workflow process, a state machine
based approach is used. The model corresponds to the request-response nature of
many service interfaces and maps directly to technologies such as Web services or
business process modeling specifications such as WS-BPEL. However, there are services
which do not follow this interaction pattern but rather proactively produce
new information to be consumed by an application. Examples include RSS feeds
listing the latest bids at an auction, result tuples from a data stream management
system (DSMS), stock price tickers or a tag stream from an RFID reader.
This dissertation shows how to extend traditional state-based workflow management
techniques with the necessary features to integrate streaming data services
and combine them with conventional request-response services. First, we study the
problem of accessing a stream from within a workflow process. We discuss different
alternatives in terms of expressiveness and performance. One approach involves the
extension of the traditional request-response task model. We show how to accomplish
this on the level of the workflow language and describe the necessary changes
in a workflow engine. Our solution provides a notion of stream which abstracts from
the mechanism or protocol used to access the content of the stream. This makes
the service composition independent of what type of stream is processed, e.g., RSS
items, RFID tags or database tuples.
The invocation of a streaming service leads to a stream of result elements independently
flowing through the invoking process, thereby creating a pipelining effect.
This leads to safety problems in the execution of the process, as state-based workflow
execution models are not designed for such parallel processing. E.g., considering
that the tasks of a process do not always take the same time to execute, a task might
not be ready to process a stream element when the element is offered to the task.
This can lead to loss of data in the stream processing pipeline. We discuss different
solutions to avoid the safety problems connected with pipelined processing and
identify the minimal necessary extension to the semantics of a workflow language.
This extension is based on a flow control mechanism which controls the flow of data
between tasks in a process and allows the safe use of pipelining in the execution of
a process.
Our extended semantics for pipelining will block a task when it is not safe to
execute it. However, if the services that are composed into a stream processing
pipeline show variations in their response time, this will decrease the throughput
ix
Abstract
of the pipeline, as our measurements show. Therefore, based on the flow control
semantics, we show how to use a buffered data transfer between tasks in a pipelined
process. The buffers will smooth the irregularities in the task duration and allow a
pipeline to run at its maximum possible throughput.
Finally, to evaluate our approach, the stream processing extensions proposed in
this thesis have been implemented in an existing workflow system. Apart from describing
the implementation in detail, we present several performance measurements
and an application built on top of the extended system. The application is a Web
mashup which integrates the data from a live Web server log with a geolocation
database in order to provide a real-time view of the visitors to a Web site together
with their geographic locations on a map.
@phdthesis{abc,
abstract = {Today, workflow languages are widely used for service composition. Workflow and
business process management (BPM) systems are typically based on a step-by-step
execution model where a task is started, the result is received, and then the next
task is scheduled for execution in a similar fashion. To track the execution of individual
service invocations and of the overall workflow process, a state machine
based approach is used. The model corresponds to the request-response nature of
many service interfaces and maps directly to technologies such as Web services or
business process modeling specifications such as WS-BPEL. However, there are services
which do not follow this interaction pattern but rather proactively produce
new information to be consumed by an application. Examples include RSS feeds
listing the latest bids at an auction, result tuples from a data stream management
system (DSMS), stock price tickers or a tag stream from an RFID reader.
This dissertation shows how to extend traditional state-based workflow management
techniques with the necessary features to integrate streaming data services
and combine them with conventional request-response services. First, we study the
problem of accessing a stream from within a workflow process. We discuss different
alternatives in terms of expressiveness and performance. One approach involves the
extension of the traditional request-response task model. We show how to accomplish
this on the level of the workflow language and describe the necessary changes
in a workflow engine. Our solution provides a notion of stream which abstracts from
the mechanism or protocol used to access the content of the stream. This makes
the service composition independent of what type of stream is processed, e.g., RSS
items, RFID tags or database tuples.
The invocation of a streaming service leads to a stream of result elements independently
flowing through the invoking process, thereby creating a pipelining effect.
This leads to safety problems in the execution of the process, as state-based workflow
execution models are not designed for such parallel processing. E.g., considering
that the tasks of a process do not always take the same time to execute, a task might
not be ready to process a stream element when the element is offered to the task.
This can lead to loss of data in the stream processing pipeline. We discuss different
solutions to avoid the safety problems connected with pipelined processing and
identify the minimal necessary extension to the semantics of a workflow language.
This extension is based on a flow control mechanism which controls the flow of data
between tasks in a process and allows the safe use of pipelining in the execution of
a process.
Our extended semantics for pipelining will block a task when it is not safe to
execute it. However, if the services that are composed into a stream processing
pipeline show variations in their response time, this will decrease the throughput
ix
Abstract
of the pipeline, as our measurements show. Therefore, based on the flow control
semantics, we show how to use a buffered data transfer between tasks in a pipelined
process. The buffers will smooth the irregularities in the task duration and allow a
pipeline to run at its maximum possible throughput.
Finally, to evaluate our approach, the stream processing extensions proposed in
this thesis have been implemented in an existing workflow system. Apart from describing
the implementation in detail, we present several performance measurements
and an application built on top of the extended system. The application is a Web
mashup which integrates the data from a live Web server log with a geolocation
database in order to provide a real-time view of the visitors to a Web site together
with their geographic locations on a map.},
author = {Bi{\"o}rn Bi{\"o}rnstad},
school = {79600},
title = {A Workflow Approach to Stream Processing},
year = {2008}
}