Publication
ETH Zürich, Diss. Nr. 79600, January 2008
Supervised by: Prof. Gustavo Alonso
Supervised by: Prof. Gustavo Alonso
Today, workflow languages are widely used for service composition. Workflow and
business process management (BPM) systems are typically based on a step-by-step
execution model where a task is started, the result is received, and then the next
task is scheduled for execution in a similar fashion. To track the execution of individual
service invocations and of the overall workflow process, a state machine
based approach is used. The model corresponds to the request-response nature of
many service interfaces and maps directly to technologies such as Web services or
business process modeling specifications such as WS-BPEL. However, there are services
which do not follow this interaction pattern but rather proactively produce
new information to be consumed by an application. Examples include RSS feeds
listing the latest bids at an auction, result tuples from a data stream management
system (DSMS), stock price tickers or a tag stream from an RFID reader.
This dissertation shows how to extend traditional state-based workflow management
techniques with the necessary features to integrate streaming data services
and combine them with conventional request-response services. First, we study the
problem of accessing a stream from within a workflow process. We discuss different
alternatives in terms of expressiveness and performance. One approach involves the
extension of the traditional request-response task model. We show how to accomplish
this on the level of the workflow language and describe the necessary changes
in a workflow engine. Our solution provides a notion of stream which abstracts from
the mechanism or protocol used to access the content of the stream. This makes
the service composition independent of what type of stream is processed, e.g., RSS
items, RFID tags or database tuples.
The invocation of a streaming service leads to a stream of result elements independently
flowing through the invoking process, thereby creating a pipelining effect.
This leads to safety problems in the execution of the process, as state-based workflow
execution models are not designed for such parallel processing. E.g., considering
that the tasks of a process do not always take the same time to execute, a task might
not be ready to process a stream element when the element is offered to the task.
This can lead to loss of data in the stream processing pipeline. We discuss different
solutions to avoid the safety problems connected with pipelined processing and
identify the minimal necessary extension to the semantics of a workflow language.
This extension is based on a flow control mechanism which controls the flow of data
between tasks in a process and allows the safe use of pipelining in the execution of
a process.
Our extended semantics for pipelining will block a task when it is not safe to
execute it. However, if the services that are composed into a stream processing
pipeline show variations in their response time, this will decrease the throughput
ix
Abstract
of the pipeline, as our measurements show. Therefore, based on the flow control
semantics, we show how to use a buffered data transfer between tasks in a pipelined
process. The buffers will smooth the irregularities in the task duration and allow a
pipeline to run at its maximum possible throughput.
Finally, to evaluate our approach, the stream processing extensions proposed in
this thesis have been implemented in an existing workflow system. Apart from describing
the implementation in detail, we present several performance measurements
and an application built on top of the extended system. The application is a Web
mashup which integrates the data from a live Web server log with a geolocation
database in order to provide a real-time view of the visitors to a Web site together
with their geographic locations on a map.
@phdthesis{abc, abstract = {Today, workflow languages are widely used for service composition. Workflow and business process management (BPM) systems are typically based on a step-by-step execution model where a task is started, the result is received, and then the next task is scheduled for execution in a similar fashion. To track the execution of individual service invocations and of the overall workflow process, a state machine based approach is used. The model corresponds to the request-response nature of many service interfaces and maps directly to technologies such as Web services or business process modeling specifications such as WS-BPEL. However, there are services which do not follow this interaction pattern but rather proactively produce new information to be consumed by an application. Examples include RSS feeds listing the latest bids at an auction, result tuples from a data stream management system (DSMS), stock price tickers or a tag stream from an RFID reader. This dissertation shows how to extend traditional state-based workflow management techniques with the necessary features to integrate streaming data services and combine them with conventional request-response services. First, we study the problem of accessing a stream from within a workflow process. We discuss different alternatives in terms of expressiveness and performance. One approach involves the extension of the traditional request-response task model. We show how to accomplish this on the level of the workflow language and describe the necessary changes in a workflow engine. Our solution provides a notion of stream which abstracts from the mechanism or protocol used to access the content of the stream. This makes the service composition independent of what type of stream is processed, e.g., RSS items, RFID tags or database tuples. The invocation of a streaming service leads to a stream of result elements independently flowing through the invoking process, thereby creating a pipelining effect. This leads to safety problems in the execution of the process, as state-based workflow execution models are not designed for such parallel processing. E.g., considering that the tasks of a process do not always take the same time to execute, a task might not be ready to process a stream element when the element is offered to the task. This can lead to loss of data in the stream processing pipeline. We discuss different solutions to avoid the safety problems connected with pipelined processing and identify the minimal necessary extension to the semantics of a workflow language. This extension is based on a flow control mechanism which controls the flow of data between tasks in a process and allows the safe use of pipelining in the execution of a process. Our extended semantics for pipelining will block a task when it is not safe to execute it. However, if the services that are composed into a stream processing pipeline show variations in their response time, this will decrease the throughput ix Abstract of the pipeline, as our measurements show. Therefore, based on the flow control semantics, we show how to use a buffered data transfer between tasks in a pipelined process. The buffers will smooth the irregularities in the task duration and allow a pipeline to run at its maximum possible throughput. Finally, to evaluate our approach, the stream processing extensions proposed in this thesis have been implemented in an existing workflow system. Apart from describing the implementation in detail, we present several performance measurements and an application built on top of the extended system. The application is a Web mashup which integrates the data from a live Web server log with a geolocation database in order to provide a real-time view of the visitors to a Web site together with their geographic locations on a map.}, author = {Bi{\"o}rn Bi{\"o}rnstad}, school = {79600}, title = {A Workflow Approach to Stream Processing}, year = {2008} }