Pervasive Software’s Datarush

Pervasive Software (NADAQ: PVSW) is company that many of us may have come across in the past. They have been around for some twenty-five years in the database and search businesses, going back to SoftCraft in the early eighties and then Btrieve with its eponymous ISAM product, subsequently becoming Pervasive Software. The company has been quoted for about a decade. Our interest in them arises because their Datarush product is targeted at doing heavy searching and data analysis on large databases running on multi-core based systems.

The rate at which we can capture data from commercial, financial, telecommunications, banking and a raft of other industries is growing explosively. At the same time the need for fast, deep analysis becomes ever more acute. The only way to address this in future will be through the deployment of large multiprocessor, multi-core systems.

Datarush addresses an important issue; the gap between what hardware vendors are promising to deliver by way of MCP hardware (Xeon, Opteron, etc., etc.) and what the software industry is able to deliver in terms of commercial applications to exploit that hardware. Datarush is not a silver bullet as far as parallel programming goes. Pervasive openly state this. What it does do is provide a set of tools targeted at what this particular sector is going to require, at least in the near term. Requirements will grow and change as people realise what parallel computing can achieve.

Datarush is targeted at applications that need to process very large datasets (sets with several hundred million records have been trialled to date) and that are analytical rather than straightforward, embarrassingly parallel transaction-based systems. In this respect Pervasive are very clear that the throughput that they have seen with their present trial applications are showing very real improvements in terms both of throughput and analytic capability.

Datarush is a framework that sits atop a JVM running on hardware from vendors such as Sun, IBM, HP, Dell, Azul, SGI. The framework is therefore operating system agnostic, running on Linux, Solaris, AIX, HP UX, and Windows Server. Of course the efficient use of multicore hardware by Datarush depends on how well the operating system and the JVM implementation manage the hardware. Not all platforms are equal. Datarush provides a series of modules that provide functions to allow, for example, sorting and collation, ETL, extraction, profiling and similar functions in a format that are readily applied by the user. The Datarush model is therefore essentially a thread-based approach at the functional level. That is Datarush uses the thread pool to implement the parallelism that is required by the algorithms created using its framework.

Interestingly, Pervasive has chosen to provide a framework through Datarush that is a functional, dataflow programming model to provide the necessary mapping between the application and the underlying hardware. Programmers use one of the predefined functions, or some combination of them, to build their process and execute it. By abstracting in this way applications developers do not have to worry about low-level issues such as shared memory, synchronization, and locking. occam and Erlang have long had a similar approach to managing parallelism, one that is increasingly viewed by many as more appropriate for programmers to work with. By separating applications developers and users from the need to consider issues such as synchronization, they are freed to focus on the data itself and the algorithms for its analysis. Greater performance should then be gained without a lot of highly specialist effort and time that more low-level approaches require. Pervasive has produced a series of claims in which the performance increases rapidly with the number of processors applied to the problem. We have not verified these figures independently, but they seem to be consistent with the model that they propose.

Pervasive have trialled Datarush on a wide variety of systems from Azul’s 384-processor engine to AMD’s Barcelona processor using a range of JVM implementations. We haven’t yet seen the results of the trials, but we expect them to show very good results. Anyone with large-scale data-analysis requirements would do well to “watch this space”.

As with any data-oriented system, it is actually the balance between compute power, memory access times and I/O that determines performance. For a well-balanced system and many classes of database problems, it seems that Datarush has a viable approach for most users. One would expect that, with a little trialling, most large-scale analysis applications ought to be able to exploit a lot of the advantage of MCPs using a system like Datarush.

Others active in the search-space arena have looked at parallel processor-based databases previously. In the 1980s and 1990s many companies, from Oracle down, experimented with implementations that could use the power of multi-processor arrays. Many of the lessons learnt made their way into production software and have been in circulation for a number of years. These lessons will in some respects translate to multi-core under a threads-based model, too. What still remains to be addressed in the longer term is what happens when hardware goes beyond the currently limited multicore systems. In other words, what happens when the threads model starts to reach the limit of its usefulness. Techniques that work well on a limited number of virtually independent processors (multi-processor mainframes for instance) don’t necessarily translate to high core-count MCP engines. The underlying technologies hint that the approaches developed to underpin the Datarush framework may well prove portable.

Datarush is at present in beta-2 and will go to beta-3 later in the year.