Stream-processing Engine
Applications that process real-time datastreams are pushing the limits of traditional data processing technologies. These applications are characterized by the need for sub-second response times——whether they involve automating trades, monitoring networks for intrusions, or tracking credit card transactions for fraud. Applications that depend on the traditional store-and-query model cannot handle the volume and velocity of streaming data, whose value might exist only in the moment.
A stream-processing engine (SPE) is data management software that enables the execution of queries and computations—— and ultimately, actions——on streaming data in real time. Previously, queries and computations could only be executed with stored data using standard database management systems. An SPE accepts SQL-like, stream-oriented, continuous queries and executes them over live event streams, outputting results in real time.
An SPE achieves real-time operation by integrating several mechanisms. First, it supports inbound processing, in which incoming event streams immediately start to flow through the continuous queries as they enter the system. The queries transform the events as they move, continuously producing results, all in main memory. Read or write operations to storage are optional and can be executed asynchronously in many cases.
Inbound processing overcomes a limitation of the traditional outbound processing model conventional database management systems employ, in which data must be inserted into the database and indexed before any processing can take place. By removing storage from the critical path of processing, an SPE achieves significant performance gains compared with traditional processing approaches.
Second, an SPE adopts a single-process model, in which all time-critical operations (including event processing, storage and execution of custom application logic) are run as part of one multi-threaded process. This integrated approach eliminates high-overhead process switches present in solutions that use multiple software systems to provide the same capabilities.
Third, an SPE provides a flexible, in-process storage model and standards-based access to external databases. In-memory hash tables are used for very fast insert and look-up operations. Embedded databases are used to ensure persistence of data and can be accessed and manipulated using SQL-style declarative queries. External, remote-process databases are accessible through standard Open Database Connectivity calls and are convenient to use when supporting legacy databases or facilitating database sharing with external applications.
An SPE has built-in filtering, aggregating and correlating, and merging operators that manipulate windows of events. Standard SQL is defined over finite-sized tables, and an execution engine thereby knows when it is finished with all its operations. In contrast, streams potentially never end, and an SPE must be instructed when to finish processing and output an answer.
The windowing construct serves this purpose by defining the scope of an operator. In a trading application, a one-hour window can be used to express a stream-oriented query that calculates an hourly volume-weighted average price. Windows are user-configurable and can be defined over time, number of events or breakpoints in other attributes of an event.
Stream-oriented operators provide resiliency to imperfections in datastreams, caused by out-of-order or delayed data arrivals, both of which occur frequently in real-world scenarios. Resiliency is achieved by making operators time-sensitive: Optionally, an operator can be told to wait a longer period of time for out-of-order messages, or timeout and stop waiting for late messages that might never arrive.
Finally, an SPE supports distributed operation for improved scalability and availability. Incremental scalability is achieved by letting processing be partitioned and distributed across multiple machines transparently, without necessitating any changes in the application. High availability is crucial to preserve the integrity of applications and to avoid disruptions in real-time processing.
流处理引擎
处理实时数据流的应用程序正在将传统的数据处理技术推到极限。这些应用程序以亚秒的响应时间为特征的——不管它们是涉及到贸易自动化,为防入侵而监视网络,还是为防诈骗而跟踪信用卡交易。那些依靠传统的存储-查询模型的应用程序已不能满足流数据的量与速度方向的要求,而流数据的价值可能只存在于瞬间之间。
流处理引擎(SPE)是一种数据管理软件,能实时地对流数据实现查询与计算、以及最终(应采取的)动作。过去,只能对利用标准数据库管理系统存储的数据执行查询和计算,而SPE接收类似SQL、面向流的连续查询,并执行正在发生的事件流,实时地输出结果。
SPE是通过将几种机理整合在一起实现实时操作的。首先,支持入处理,即输入的事件流一进入系统就马上开始流经连续的查询。在它们流动时,查询变换事件,连续地给出结果,所有这一切都是在内存中进行的。对磁盘存储的读或写操作是可选的,在很多情况下是被异步处理的。
入处理克服了常规数据库管理系统使用的传统出处理的局限,在出处理中,数据必须插入数据库,并在开始任何处理之前建立索引。通过将磁盘存储排除在处理的关键路径之外,与传统的处理方法相比,SPE获得了明显的性能提高。
第二,SPE采用了单处理模型,其中所有与时间密切相关的操作(包括事件处理、定制的应用逻辑的存储和执行)是作为一个多线索进程的一部分运行的。这种整合的方法消除了进程转换的高开销,在使用多个软件系统来提供同样功能的解决方案中就存在着这种进程转换。
第三,SPE提供了一个灵活的进程间存储模型和基于标准的对外部数据库的访问。内存中散列表用于极快的插入和查找操作。嵌入的数据库用于确保数据的一致性,以及能利用SQL风格的描述性查询进行的访问和操纵。外部的、远程进程数据库通过标准的“开放数据库互连”调用进行访问,当要支持过时的数据库时,这种数据库用起来很方便,能方便地实现数据库与外部应用程序的共享。
SPE拥有内在的过滤、聚合和相关、以及合并操作符,它们操纵事件的窗口。标准SQL定义在有限大小的表格之上,从而执行引擎知道何时完成了所有的操作。相反,流存在着永不结束的潜在可能,在结束处理和输出答案时SPE必须要有指令。
通过定义操作符的范围,窗口构建为此目的服务。在传统的应用程序中,一小时的窗口可以用来表达计算以小时为量加权的面向流的查询。窗口是用户可以配置的,可以定义在时间、事件数量或者一个事件中其他属性的断开点上。
面向流的操作符对数据流中因次序破坏或数据达到的延误造成的破坏提供了弹性,而这两种情况在现实世界中是经常发生的。弹性是通过使操作符对时间敏感而获得的。操作符可以有选择地被告知,对失序的信息等待更长一些时间,或者规定的时间用完不再等待可能永远不会到来的过时信息。
最后,SPE支持改进可扩性和可用性的分布式操作。增强可扩性是通过让处理分割并透明地分布到多个机器上实现的,不必修改应用程序。高可用性对保留应用程序的完整性是至关重要的,可避免实时处理的中断。