The Concertant Blog
Sat, 26 Feb 2011
by Peter Dzwig
A perennial problem has been how to benchmark clusters – and if benchmarked what indeed does it
mean. Oh and can I do it cheaply and easily and without any preconceived notions of what the system should
Well, we now have the beginings of an answer, to which people are invited to
contribute. Raul Gomez has announced the beginings of a
Sourceforge project to create such a product, based upon the results of his thesis project. The project,
called ClusterNumbers. It must be
made clear that ClusterNumbers isn't unique in the world, but it is open source and based upon some sound
starting points. See, for example, those in The HPC Challenge of
Luszczek, Dongarra et al.
The idea is that any user with a modest level of skill ought to be able to benchmark their system in order
to be able to have a grip on how their distributed application should run on a particular cluster. The
tool therefore should address itself to the major issues affecting cluster performance and make them
readily accessible via a single user interface, and offer to the user a basic configuration for carrying
out the benchmark and identify factors impacting performance.
The core of ClusterNumbers is a set of packaged benchmarks for CPU, memory, networking, etc., that can be
accessed either individually or as a whole. These are HPL (High Performance Linpack) across the cluster;
DGEMM providing matrix multiplies on cluster node; FFTE tests for CPU execution rates by running discrete
FFTs; STREAM to measure CPU/memory performance; disk performance is measured using IOTRANS; and network
capability under various loading is measured using Netperf and PTRANS. The observant reader will have
noted that many of these are scientifically-oriented benchmarks and the majority are those used in The HPC
Challenge, nonetheless here the aims are somewhat different and they are being supplemented by others.
ClusterNumbers allows the user to select the kind of benchmarks to be run from a PC window that
communicates with a daemon that runs on the cluster's admin node. The selection is then made to run the
appropriate subsets of the benchmarks listed above.
Getting the members of the FOSS community involved through an Open Source project in Sourceforge indeed
seems a logical step and according to Gomez there has already been a strong response from the HPC
community. And therein lies an issue. While without doubt the high performance community lead a lot of
fields and stress systems in ways that make their contributions to projects such as these invaluable, it
is important that those who run clusters in other environments contribute. Their presence would give
ClusterNumbers a broad following and ensure that Gomez' work is not “just another Sourceforge
project”. After all some of the most intensive users of clusters in the world are very much not
conventional HPC users although their systems are certainly high-throughput. Perhaps they should include
something like DBT2 or an appropriate derivative as a starter.
The first steps are to create a roadmap. The wider the input at that stage, the better for the long term
viability of the technology. I urge those outside the HPC community to make their input in the interests
of giving ClusterNumbers a wider user base than might otherwise be the case.
Sat, 05 Feb 2011
by Russel Winder
Till recently there was an implicit understanding throughout the various computer-based industries that
data meant relational database meant SQL. A lot of data fits very nicely into the relational model, but
not all does. A lot of queries on data (held in relational databases) is easily expressed using SQL, but
not all are. Trying to enforce all data and queries into the relational/SQL model leads to a lot of
problems in some cases. Hence the NoSQL movement.
The likes of Cassandra, MongoDB, CouchDB, etc. are taking the world by storm since they provide
non-relational data storage and querying facilities that works much better than the relational one for
some problems. There is a potential for the pendulum to swing too far, of course, and for problems well
handled by relational frameworks to be forced into NoSQL frameworks. This would be a shame.
The future is clearly heading towards a “mixed economy” of relational/SQL and NoSQL. A key
decision for storage-based applications – which increasingly means every application there is
– will be whether to be relational or not. An issue to be solved by the analyst/designers on a case
by case basis. However there is another issue and that is analysis and processing of data. It remains
the case, at least until NoSQL systems have much greater impact on analytics generally, that doing
analytics and data mining is seen as an application that employs SQL to access data. Fortunately, there
is a third way.
The core of the current problem for most analytics activity is that any application based on SQL queries
cannot harness the mutlicore and cluster hardware architectures that are now the norm. Analytics
application cannot therefore make use of all the parallelism available. Big problems can therefore not
only take hours, days, even weeks, but bringing in new hardware with more cores and/or more cluster nodes
will make little or no difference to the execution time of the application. Given that the future is one
of massively increasing amount of data, analytics appears to be in deep trouble. What's the solution?
Change the paradigm. Enter Pervasive DataRush.
Pervasive DataRush is a software framework that implements a
dataflow architecture. An dataflow structured application is written as a collection of operators
connected together by uni-directional channels (things down which data can flow). An operator can have
many input channels and many output channels. An input channel can be the output channel of another
operator or come from a data source. An output channel either connects to another operator as one of its
input channels or it goes to a data sink. An operator is an event driven process, where the event that
triggers execution is some state of the input channels.
This is a very different view of computation from the shared-memory multithreading view, be it
object-oriented or procedural. The dataflow model requires operators to be processes, so there is no
shared memory between operators. This means operators are highly parallelizable using all the processing
capability the hardware has. The execution triggering events are all the synchronization that a
computation needs, no locks, no semaphores, no monitors, no programmer confusion. The dataflow approach
may seem strange to programmers brought up on shared-memory multithreading, but once you “get
it”, dataflow makes for easier, faster, and less error prone programming. Moreover the result is an
application that can easily and naturally handle as much parallelism as the problem can handle. The more
processors you throw at the execution, the quicker it finishes. Up to the theoretical maximum that is
possible, of course.
So whilst dataflow is not a “silver bullet”, it is about as close as we are likely to get as
far as implementing algorithms is concerned, at least in the short and medium terms.
At the European Data Integration Summit 2011 (EDIS 2011) event held in the London Bridge area 2011-02-02
– see this website
– Pervasive Software announced version 5 of Pervasive DataRush.
We are currently assessing this new version, which is a relatively radical relabelling and restructuring
of the framework compared to version 4. Initial impressions are that the changes are favourable.
Certainly the labels are more consistent with, and less idiosyncratic compared to, the underlying dataflow
model. At the time of writing, none of our sample applications run using the new API. However, I think
it bears repetition, the new API appears to be better labelled than the old. At this stage, if I have an
adverse criticism, it is that there needs to have been some more technical authoring on the documentation
Prior to the EDIS 2011 event, we had a briefing session with Jim Falgout (Chief Technologist, Pervasive
DataRush) and Ray Newmark (Director of Sales and marketing, Pervasive DataRush). We had previously talked
with Jim in early in 2008 and then again at SuperComputing 2008 (SC08) – and wrote articles about
Pevasive DataRush at that time.
Is Dataflow the New Black?
and Pervasive Software’s Datarush
– so it was something of a reunion to meet with him last week. Ray has been on board for about 8
months and seems to have provided the direction in terms of marketing that we felt was missing in 2008.
Furthermore having Pervasive DataRush with a definite strategic place in the Pervasive “end to
end” integration architecture has given Pervasive DataRush a definite role in something bigger.
Pervasive DataRush is, of course, a separate product, and can be used independently of the rest of
Pervasive's offerings, but without a place in Pervasive's offerings Pervasive DataRush was a little out in
Overall we think that Pervasive DataRush has a very rosy future, something we perhaps couldn't have said 2
years ago. It is a good product, with a clear role in an overarching architecture, as well as an
The Actor Model, originally presented sometime in 1973, has been getting a lot of press recently,
particularly as it is the model of concurrency and parallelism in Scala. But it is noticeable that
dataflow frameworks are appearing for Scala. Indeed within the Groovy milieu, there is a framework that
provides not only actors, but also dataflow – and indeed CSP (communicating sequential processes)
– GPars, cf. here. So the FOSS community are beginning to
put out dataflow frameworks for JVM-based systems. This strongly validates Pervasive DataRush as the
direction of future computation in the increasingly multicore world.
Fri, 04 Feb 2011
by Russel Winder
2011-02-02 was the day of the European Data Integration Summit 2011 (EDIS 2011) event, a one-day, four
track, marketing conference held in sight of Tower Bridge London, organized by and for Pervasive
Software Inc., its services and its products – see
this website. The day prior, Pervasive
organized a number of analyst briefings. In one of them, we talked with John Farr (President and CEO of
Pervasive) and Mike Hoskins (CTO of Pervasive) about the company, its products, its history, and its
future direction. The following stems from those discussions, with some fill-in from the conference
15 years ago Pervasive was principally a database provider, operating successfully in the not-DB2,
not-Oracle database space – Pervasive was born out of a name change of BTrieve Technologies Inc.
(cf. this Wikipedia page for more details on the
history). Over the last ten years, Pervasive has moved increasingly, and profitably, via various
acquisitions, into the “integration” space – whilst at the same time maintaining its
successful database business. Many of Pervasive's competitors have come and gone, usually by being bought
by one of the “biggies”. Pervasive has continued as an independent player, continuously
reporting profits, indeed growth – generally well above inflation. A good investment.
Pervasive is not though resting on it laurels. Far from it. The company invests 25% of profits back into
R&D, and that is mostly R rather than D. Pervasive is investing more heavily in R&D than might be
thought of as normal for a company such as this, because it wishes to be at the forefront of innovation.
Innovation at the moment is generally seen as making the Cloud work and be relevant. Pervasive is at the
heart of this, which is somewhat essential for a company which is more and more emphasizing its
“integration” business. In the Cloud, we have:
PaaS (platform as a service), but Pervasive is not really in this business, they are leaving it to
people like Amazon, Eucalyptus, etc.
SaaS (software as a service), but Pervasive is not really in this business, this is for the likes of
Google and Microsoft.
Pervasive is a "data company", their interest is in providing infrastructure for customers to manage their
data. Pervasive's integration products allow people to connect their various sources of data in whatever
way they wish. This means software can be moved to where the data is, or increasingly common in the more
and more Cloud-based approaches, move the data to the program. Pervasive are looking to be the market
leaders in DaaS (data as a service).
Pervasive clearly have a strong vision of how to provide innovative Cloud-based framework to their
customers now and in the future. Which is good, but not really anything to do with parallelism, multicore
and cluster computing. So what is the interest for Concertant?
Part of Pervasive's strategy is to use internal “startups” as a cornerstone of its R&D
policy. A group gets “spun off”, albeit actually internally, to work on something. The two
currently running are Pervasive Data Solutions and Pervasive DataRush. It is Pervasive DataRush that
really piques the interest of Concertant. Pervasive DataRush is a dataflow framework, a software
architecture that is neither new nor currently seen as mainstream. The process and message passing basis
of dataflow is what makes Pervasive DataRush interesting, able to harness multicore and cluster
parallelism, and the reason it will be successful.
In fact Pervasive DataRush has been going for a while – the release of version 5 was a big
announcement made at EDIS 2011. Moreover we have interviewed Jim Falgout (Chief Technologist, Pervasive
DataRush) previously, early in 2008 via telephone conference, and then in person at SuperComputing 2008
(SC08) – see Is Dataflow the New Black?
and Pervasive Software’s Datarush.
We will address our perception of the technical progress of Pervasive DataRush in another article. For
the moment, it is the strategic importance of this product in Pervasive's portfolio that is interesting.
Pervasive has got the multicore and clustering bull by the horns. In version 11 of its database offering,
Pervasive SQL, there is support for multicore processors as well as 64-bit processing and IPv6. Now it
has placed Pervasive DataRush as its data analysis, analytics, data mining offering. Not a part of the
data movement around the Cloud, but core to its integrated offering. This integrated offering is marketed
under the label Pervasive DataCloud which is not PaaS and not SaaS, it is definitely a data-oriented
framework that sits over PaaS, employing SaaS.
Pervasive is tiny compared to the IBMs, the Oracles, the HPs, the EDSs of this world, but it looks as
though that is exactly why it is remaining a very successful company. It is providing highly integrated,
low-ceremony, low-overhead solutions – something required by SMEs, which cannot be provided by the
Whilst Pervasive remains profitable and innovative, I truly wish it remains independent and does not
become the target of acquisition.
Mon, 25 Jan 2010
by Peter Dzwig
Probably about 25 years ago I wrote a report for a client who was thinking of opening up shop in
Russia. Part of the core of that report was that not every society has the same (market) traditions and
cuture as those in the West. That may or may not seem obvious. The client wanted to enter the Russian (at
that time recently-Soviet) market with an American-style market proposition. My view was – and still
remains – that the market wouldn't necessarily adopt the western model rapidly, if at all. We are
all aware that the Russian "model of capitalism" is very different from the western one, let alone the US
version. Still more so the current Chinese version of capitalism "with a Socialist face", we have no idea
when, if ever, it will become like the western version. In fact all the evidence is that the two are
So it perhaps should not surprise us that the Chinese model of the Internet is also so different from
ours. China is a 3,000 year-old civilisation with a long tradition of insularity, of which the current
China is merely the present manifestation. For a long part of that time it has looked down on the outside
world and in effect closed its doors against the outside world. Therefore I was fascinated by an article
in the FT) of 20th Jan about the so-called "Chinese Firewall". It didn't
come as a great surprise that Google and the Chinese aurthorities should have had a run-in, one was surely
inevitable if not necessarily imminent. There is a fundamental tension betweeen the Chinese way of doing
things and the western way of doing things.
Here I am specific in my use of the term "western" as countries such as Japan and Korea, as well as a
number of South-East Asian countries, sit along a spectrum between the western model of the Internet and
the Chinese. The fact that China has a much bigger population than any other country and is "opening up",
is seen by many in the West as a huge potential market for their companies. To the Chinese it probably
appears as a wholly different proposition. It is not clear that the model that the perspective that the
Chinese adopt is anything like the rest of the world's. After all their population is about three times
that of the EU and bigger still than the US, they have a huge market for their own technology and do not
have to be behoven to the outside world. So perhaps we are wrong to be surprised at western perceptions of
China's attitude towards such as Google (and they aren't the only ones). I am not referring to alleged
attempts to hack dissident's Googlemail accounts, but to the overall marked divergence in attitudes
between China and Google going back to the point at which Google entered China.
More immediately salient is that, as the FT article shows, the Chinese usage profile of the Internet is
different from that in the West. Certainly there are areas that are very different because of the
sensitivities of the Chinese government to social networking sites; but what struck me was figures from
McKinsey quoted in the FT article about the general profile. The next part is broadly a summary of those,
for which I take no credit. A Chinese person is likely to make (all figures I am going to give are rough)
2/3rds as much usage again of the net for email and searching for information as a European counterpart; a
Chinese person makes only one-eighth (!) of the use of the net for work-related purposes as a European
counterpart; but 60% more use for gaming; for chatting/instant messaging usage is 235% of European usage;
and almost 80% greater for downloading films or music.
A proportion of the diversity of these figures might be laid at the door of lack of social networking
sites (email traffic) and Chinese attitudes to intellectual property, in particular copyright
(downloading), but not exclusively. The Chinese usage model for the net appears to be one of a gigantic
playground. For most net-using westerners with a tradition of research, particularly in Europe, the net is
much more of a space within which to find out information as well as to communicate with friends. That is
not to say that westerners don't game, download games or chat; but in China the figures are much greater.
I don't actually want to comment too much further on the Chinese model of the Internet except to say that
it offers an alternative profile of usage to the typical one that we have adopted in the west. The Internet has
become what it has become in the rest of the world because of the model that the rest of the world has
adopted, driven by a US-centred model. Is that the only model? Should we not at least consider the
alternative ways of using the Internet, and what that might imply? If we look at other usage models then
perhaps we could learn for the future and indeed could plan our own networks better. I would be fascinated
to see what the comparable figures look like for other emerging economies as they evolve over time. India
and Brazil come to mind here. It would also be interesting to see how those evolve over time, what the
regional evolution is like and also how it has developed the past.
In France in particular, there is a debate going on at present as to how to deliver much greater bandwidth
than they have at present, including to rural populations – and this is in a country where
substantially higher rates than the UK has are the norm. The Digital Britain plan to deliver 2 MB/s (max)
to the door would be woefully inadequate if we were to look at a model in which there were a lot more
gaming, chatting and above all downloading of movies/music and on-line TV. There are two sides to this
problem, one is typified by the usage model as above. The other is – and this is why this appears
here – that a network's characteristics in terms of bandwidth needs are set by the technologies that
are coupled to it. Processor speeds are growing and will go on growing. Multicore means that that is a
practical reality, that after all was its rationale. That will increase demand on the ability to download
and upload – and not just for the user, but for industry as well. Thus network speeds are a factor
in economic performance; lack of delivery will ultimately be a barrier to economic competitiveness.
The Digital Britain plan is woefullly inadequate, both in respect of technology (bandwidth) and in terms
of delivery targets. It is also not going to address the Uk's need to be able to compete. Even were we to
adopt a less business and more "Chinese" model of work, where more than raw speed is the issue; quality of
line, latency and so on are more important there we would fall short becuase most of the network in the UK
is inadequate to deliver to most people. If the target does not change it will do little to reduce
"notspots" in relation to average speeds. In fact the concern is that they may even expand. To put some
figures to this for a moment, if you look at a href ="http//www.speedtest.net/SpeedTest"> SpeedTest,
for example, the UK (by their sampled speeds) ranks 41st globally by download speed and 64th by upload
speed. For the "global leader in technologies" that the government aspires to be, these are not good
figures. Yesterday (22nd January) BT has announced that it will deliver 40 Mb/s with its
service to a limited number of subscribers rolling out this year and "reaching"
4 million by next year. The fact that up to date technology can already deliver well over 50 MB/s in
real usage perhaps says more about the ailing state of the UK infrastructure than anything else.
Tue, 15 Dec 2009
by Peter Dzwig
Predictably the fuss about the demise of Larrabee lingers on.
Let's start at the beginning. Intel has said that the processor will have a continued life as an SDK of some
form, and that it will be available to various types who have expressed a desire to be able to use
it. Importantly there is Intel's offer for the HPC community to be given access to it.
The reality has to be that Intel could no longer see the market opportunity for the technology. Surely at
one time they could do so: it needed that, whatever other possibly apocryphal stories are told about its
origins, in order for the management to give it some hope of seeing the commercial light of day.
Had the design run out of engineering steam? Possibly. Larrabee was expected to have seen light of day in
the second half of next year. By then the other players in the market will have pushed further ahead. The
more so since all the hype will have focused their minds on doing so. By which time perhaps Larrabee would
have looked architecturally interesting but behind the curve. So it is far from impossible that the high-ups
on the engineering side just decided that they weren't going to be able to squeeze enough out of the
technologically. From an engineering perspective they will have learnt many lessons. In fact, we can be
almost certain of it. There are even a few hints of Larrabee in the SCC; not many, but some. Larrabee
should really be seen as a test bed for ideas: about graphics processors, about memory disposition, about
interconnects, programmability and much else besides. That's its real long term value for Intel.
The vacillations over the last couple of years or so, during which we saw specs change and configurations
develop and then disappear, has contributed to the market's decreased desire for Larrabee – though
most of this downward drive has been driven by people who didn't really know the product. Its slot has
meantime largely been filled by NVIDIA et al. and the goalposts have really moved. For Intel this meant that
had they ever got to market they would have come in, a perhaps distant, third. That doesn't make business
sense for them.
Looked at from a purely commercial perspective then, the decision to remove Larrabee from the likely product
list seems entirely reasonable and perhaps inevitable.
In conclusion this was the right decision to take. The technology has gone as far as it can or at least as
far as Intel wanted to take it. There has been substantial value added to Intel's business by the teams that
contributed the engineering skills and intellectual property, and finally it will be made available to
internal and external developers and the HPC community as a development platform.
Fri, 11 Dec 2009
by Russel Winder
Groovy is more and more rapidly gaining traction in the Java
community. The JVM is becoming the standard hardware independent platform for almost all new applications
-- especially those that are Web-oriented. Polyglot programming is rapidly becoming the norm: systems are
developed in some mix of Java, Scala, Groovy, Clojure, Jython, and JRuby. Until recently Jython and JRuby
were being directly suported by Sun. However they have been ejected from Sun's corona as part of the
purchase of Sun by Oracle.
Groovy has, since its inception in 2003, been developed by the open source community as
a Codehaus project. Inspired by Ruby on Rails as a web application
development platform, the Grails project was born, again driven by the
open source community. But there is commercial development interest. The company G2One that was formed
by the Groovy and Grails Project Leads, was bought some time back by SpringSource (who own Spring). They
then put quite significant resources into Groovy and Grails development and most especially into Eclipse
support for Groovy. SpringSource's interest in it was motivated by the fact that Grails was beating "Ruby
on Rails" in the commercial arena; that it uses Spring (and Hibernate) under the hood; and that Grails is
the easiest way of developing Spring-based applications that there is.
SpringSource has in its turn recently been bought by VMWare. So whilst Groovy and Grails are still owned by
the community, VMWare is now putting resources into development via SpringSource, but guided by VMWare's
commercial strategies. This means virtual machines and clouds.
Graham Rocher (Grails Project Lead) yesterday gave a presentation at Groovy & Grails eXchange 2009 in
which he outlined what is coming in Grails v1.2, to be released within two weeks. Using virtual machines
for deployment and getting into The Cloud were clear messages. Currently The Cloud more or less means
Amazon (which may not acceptable for many businesses) but there was also the “private cloud”
idea: business having internal clouds and using virtual machine technology to make applications deployment
easier and isolated from the outside world. VMWare's hand in this message is rather clear, even though
the presentation was branded SpringSource!
Grails, and on the back of it Groovy, is now being made ready for prime time: Grails version
1.2 and Groovy version 1.7.0 are being rolled out before the end of the year providing the base for next
year's new crop of Web applications.
Thu, 10 Dec 2009
by Peter Dzwig
So Intel have canned Larrabee and gone to 48-core clouds on chips. Is it really that simple?
The short answer is “yes” and “no”.
Intel have been looking at a variety of architectures over a period of time. This is an obvious step: if
they are going to change their core counts beyond the (relatively) few cores that they have on production
chips at present then they need to understand what the issues are going to be and what design strategies
are useful. The Terascale chip (aka Polaris) was a very different beast from the recently announced
48-core processor, being 80 VLIW-core based so looking nothing like an x86 configuration and described by
some who knew as barely programmable. Nonetheless there were apparently a lot of lessons. Intel describe
the latter as “primarily a circuit experiment” and SCC as “a circuit and software
The current processor is (apparently) readily programmable, being “IA-compatible” and so can
run off-the-shelf apps. It's uses message-passing shared virtual memory and actors. It is also made on
45 nm technology. However if the announcement is to be taken at face-value then this too will not
make it to production.
Its not that surprising that Intel appears to be slowing down work on Larrabee, given its on/off history
and changes of configuration, rumoured or otherwise. It is very unlikey though that Intel would want to
loose the experience and technological benefit gained from developing it. That's not how engineering
progresses. My guess is that some of the developments will re-appear in some shape or form in future
These are all steps along the road.
If I were being asked what will make it to production my guess is that it will be not look a great deal
like any of these. A heterogeneous hybrid with several different types of cores, some targeted at specific
problems might be closer. Whatever does appear will contain lessons that have come from all of these
processors, and from all Intel's other multicore processors – and Pentiums too. Probably the 48-core
system is somewhat closer to what is likely to be reality than Terascale ever really was. Given the
predicted growth in numbers of cores on a chip then for reasons of engineering and programmability it
would appear that distributed memory and some kind of network is the way to go. IA will most probably be
implemented in some shape or form, if only for backwards compatibility.
It is also interesting to speculate how Intel will address its future markets. Traditionally the embedded
market has seen different architectures. However the commercial challenge of widely differing novel
architectures will be great. Whether or not this leads to some design rationalisation is still to be
seen. Would multiple, possibly divergent, processor lines make commercial sense in the nearer term?
Mon, 09 Nov 2009
by Peter Dzwig
ARM are always a company worth noting, if only because they dominate a market sector (processors for
mobile devices) even more completely than Intel dominate the PC market. According to figures currently
being bandied about, ARM hold in excess of 95% of the current mobile market. According to some that goes
as high as 98%. What is perhaps an even more important measure of that dominance is that most handsets
have 2–3 ARM processors. That is real market dominance.
Companies such as NVIDIA, Qualcomm and others are using ARM's processors to move the market for netbooks
and notebooks ahead. This is an area in which Intel sees itself and its Atom as having a natural
dominance. That is clearly not the way that ARM and its collaborators anticipate that things will turn
A few days ago, ARM had its annual technology meeting in “The Valley” around which clustered a
number of announcements. Perhaps the most interesting for us was the link between ARM and FPGA
manufacturer Xilinx. The collaboration owes a lot to the finalisation, or near finalisation, of
the AMBA bus specification. Xilinx
can now see – and are keen to tell the world – how the combination of FPGAs, ARM's Cortex and
AMBA fit together and how AMBA may become a solution for on-board FPGA communications. AMBA is not an
While AMBA is almost thirteen years old it has now reached a level of maturity where it is now seen as a
product capable of delivering pretty much everything that an embedded designer is looking for. In
that respect at least it is regarded by many as the de facto 32-bit embedded standard.
Wed, 28 Oct 2009
by Peter Dzwig
Tilera first made a name for itself a couple of years ago with the Tile64, which
we wrote about at the time. Now Tilera have announced the Gx
series of "tile"-based processors. A development of the earlier Tile64 and TilePro chips, the
Tile Gx can have from 16 to 100 tile processors on the chip. These form a homogeneous array of
64-bit VLIW processors with a 64-bit instruction bundle, interconnected by a mesh network. The pipeline
is three-deep and can handle up to three instructions per cycle. The whole is programmable using C or
C++ via the GCC compiler, and can run Linux. Tilera have an Eclipse-based IDE.
Claiming to be “the world's first 100-core processor” and to “offer the highest
performance of any microprocessor yet announced by a factor of four” the PR is a little over-hyped.
However the Tile Gx is likely to be an important chip in its target sectors. This is essentially the
embedded markets covering the gamut of high performance applications such as advanced networking, wireless
infrastructure and digital video. These don't surprise any Tiler-watcher. However the addition of Cloud
computing as an applications area shows that they are starting to move away from their traditional markets
to look more broadly. All that the PR says is that suitable applications may lie in areas such as LAMP
servers, data caching and databases. Whether this means that anyone is already running a corporate
database on a Tile system is not made explicit. There are, though, applications in databases and data
processing applications which are well-suited to multiple data pipelines.
The Gx does appear to offer some real performance leaps and some very interesting architectural
novelties. The 100-core angle is really just what it says. We would be hard pressed to think of another
processor with 100 cores, 80 yes (Intel Terascale), 90 yes (Cisco), over 100 yes (many of them, some even
saw the light of commercial day); but exactly 100? We can't think of one for which you could produce a
product spec-sheet! The Tile Gx series comprises Gx16 (16 cores, 4x4), Gx36 (36 cores, 6x6), Gx64 (64
cores, 8x8), and the Gx100 which unsurprisingly has 100 cores in a 10x10 grid. The performance claims will
need some practical justification, but that will have to wait till silicon is available.
Although the chips arrange the tile processors in a regular 2-dimensional array, problems do not always
fit such a structure. The Gx series has routing capabilities to get round this: the programmer can build
appropriate networks of processors with the cores and the interconnect and do that without compromising
performance. If you look at most applications where you are into proper parallel processing, i.e. mapping
directly between cores and algorithm components, then you end up with irregular networks. This is why
Tilera's local memory structure (32K L1i, 32K L1d, 256K L2 per tile) is appropriate in a very general
Sadly you can't expect to see the Gx in your friendly local distributor's catalogue soon. The Gx36 is
slated for introduction around Q4 2010 – which the experienced among you may interpret as you see
fit. We will though be writing more about the Gx series soon.
Thu, 08 Oct 2009
by Russel Winder
These days when the average programmer thinks of handling concurrency and parallelism, they usually think
"threads". This then leads to horrible synchronization issues and worrying about locks, monitors and
semaphores. And in the end the programs generally have non-determinstic errors. At the heart of the
problem is shared memory.
Many, many years ago, models of concurrency were proposed -- cf. Actor Model, CSP (Communicating
Sequential Processes) -- and these are seeing a huge resurgence of interest with the pandemic parallelism
now available on all computers due to the Multicore Revolution. Erlang's parallelism is based on the
Actor Model. Scala uses the Actor Model as its mechanism for dealing with concurrency and its subset
parallelism. Clojure also makes use of the Actor Model. CSP is springing up with JCSP, Python CSP, etc.
Java though is still stumbling along trying to harness parallelism with threads. Until now.
The Groovy community has been discussing what to do about harnessing parallelism for a year or two now.
Last year Václav Pech acted and started the GParallelizer project. This was inspired by the work in Scala
on the Actor Model and focused on using Groovy as a base on which to write a domain specific language
(DSL) to be a coordination language managing parallelism. Till a couple of months ago this had been a
one-developer project. Now though it has become a serious, and probably a strategic, multi-developer project.
Rebranded GPars, and now
a Codehaus project, Václav Pech is leading an effort -- which
includes your current author -- that will undoubtedly see Groovy used as a
way of specifying the concurrency and parallelism architecture for many a Java system. Parallelism on the
JVM just got very Groovy.
Wed, 07 Oct 2009
by Russel Winder
Over the last few days, there have been various announcements by Intel, AMD, NVIDIA and all the usual
suspects, of new or revamped tools to enable programmers to harness the multicore CPUs and GPUs that are
now effectively mainstream hardware. There is clearly a yawning chasm between today's hardware systems and
the use made of these by today's software systems. A gap that is likely to get bigger before it gets
smaller. Hence the push by the hardware manufacturers to ensure there are good tools available. It is
purely enlightened self-interest.
Why comment? It seems that unless you are using Visual Studio you do not get access to these tools. Now
whilst Visual Studio is a very important “player in the game”, an increasing number of
developers use Linux, Solaris, Mac OS X, Free BSD, etc. as their development platform. Moreover, Linux is
the majority player in the “operating system for HPC” stakes. It seems a poor strategy
therefore to treat these platforms as third-class citizens, or as in many cases simply ignore them –
particularly bearing in mind that today's HPC application is tomorrow's mainstream application.
Then of course there are articles talking about how these tool manufacturers are “in talks
with” you-know-who about operating system support and tool support, and your mind is drawn to
“conspiracy theories” . . . is this just another aspect of the attack on alternative operating
systems by a monopolist?
Wed, 16 Sep 2009
by Peter Dzwig
At present there are some very good tools out there for supporting parallel program development; from
language extensions and compilers to software architecture and design tools. Yet there are none out there
which actually deliver the holy grail of parallel programming; to take an arbitrary piece of sequential
code and transmute it, in the modern equivalent of the Alchemists' Dream, into code capable of running on
any platform and delivering anywhere near ideal performance.
In fact this particular dream is highly unlikely to happen because, in general, parallel code, and in
particular parallel algorithms and hardware, are substantially different in form from their sequential
Parallel programming, in many diverse forms, has been around as a commercial reality since at least the
1970s when ICL (now part of Fujitsu) launched the DAP (Distributed Array Processor) as an attached
processor for its mainframes; you can push that date back further if you include academic exercises and
multiple CPU systems. Yet to date the dream goal hasn't been reached. Technologies as diverse as the DAP,
SuperNode and its relatives from Meiko and Parsytec, from supercomputers to modern multicores, sought to
solve the problem, or at least address it, through the deployment of specialised compilers, extensions to
existing languages, or complete new languages. This worked adequately at the time because the user base
for each system was limited in one way or another, and many of the would-be users were in research
facilities. This meant that they had the time to work out the problems, and modify their code
The adoption of multicores as the way to deliver cost-effective performance (by a wide variety of metrics)
by the preponderance of manufacturers has meant that market penetration has increased for processors
having two, four or eight cores. The demographic of the user base has broadened dramatically as a
result. This means that no longer are users prepared to deal with arcana in order to get promised
performance, they want it delivered simply. They don't want to see any change in the way that they program
and retraining should be minimal if any is needed at all. Up to now the user (except for the specialist)
has been hidden from the details of a processor by the operating system and other layers.
As the number of cores on a chip grows – and it will do – the problem of how to realise the
performance on offer will become increasingly more complex. The industry cannot expect the user or
programmer to learn specific languages or extensions to languages in order to be able to program
company X's laptops. It will get more complex still because it will be possible to create highly
customised specialist installations. While potentially important where there are particular requirements
for high performance, these will reduce the potential for code portability.
Then there are all those different architectures...
What the user will want is to program/develop their program/application once and once only, thereby
preserving software investment. Whereas nowadays a modern applications can still run on a 1.2 GHz Pentium
(albeit slowly), the question of such backward code deployment will become more complex and eventually
How are we to address this? The simple answer is that we don't know at present. Yes, we could point to a
few technologies around at present; but perhaps it is better to ask what the user is likely to want. If
portability (i.e. maintaining the value of software investments) is to be the principle criterion, then
surely we need to hide hardware changes from the user. If we assume the existence of some sort of
operating systems level then we are presumably interposing an additional layer between user and the
operating system. One would anticipate that would detract from raw performance, which for those who demand
raw performance, would be detrimental.
However while this might be important for certain user communities we must accept that the vast majority
of users, and indeed of developers too, don't care – provided that they don't loose
“a lot” of performance. This is particularly important as performance improves. What we
should be looking at is the proportion that is lost. Provided that this can be limited to a low proportion
of the overall figure the vast majority probably don't care. Indeed there is some suggestion that such
overheads may reduce over time if history is a guide.
What tools we might run over a large core count system, we don't yet know. It may well be that the tools
that we will need don't yet exist. It would be an extremely worthwhile program of research for people to
step back and take a long hard look at what we really need. Present assumptions, from almost all sectors
of our industry is that they will be like what we have already. What justifies that assumption? If you
look at that question in some depth – and that is a part of Concertant's activities – then the
evidence that we know how to deal with even 64-core systems (due around 2015) is fairly scarce. There is
certainly a paucity of consensus. It takes a good few years to get from the research lab to the market,
so work had better get underway soon.
The tools industry is quite probably set to change, conceivably beyond all recognition.
Tue, 15 Sep 2009
On June 29th Concertant organized a workshop on behalf of the UK's Grid Computing Now! Knowledge Transfer
Network (GCN!-KTN) to to investigate the consequences for the UK of the multicore revolution.
Around 40 invitees from end-users, industry (including the software and hardware industry), government,
other KTNs and academia attended the workshop. The report contains a set of recommendations to improve the
UK's competitive position in the global MCP market.
The final report is here.
Fri, 11 Sep 2009
After a campaign including a petition in the Prime Minister's website, Gordon Brown has finally apologised
for the “shabby” way in which Alan Turing was persecuted in the 1950s for his homosexuality in
a series of events which led to his committing suicide and which lost the UK the most influential computer
scientist. The apology has been the goal of a petition being run for the past weeks.
The text of the statement from the Number 10 website
is as follows:
2009 has been a year of deep reflection – a chance for Britain, as a nation, to commemorate the
profound debts we owe to those who came before. A unique combination of anniversaries and events have
stirred in us that sense of pride and gratitude which characterise the British experience. Earlier this year
I stood with Presidents Sarkozy and Obama to honour the service and the sacrifice of the heroes who stormed
the beaches of Normandy 65 years ago. And just last week, we marked the 70 years which have passed since the
British government declared its willingness to take up arms against Fascism and declared the outbreak of
World War Two. So I am both pleased and proud that, thanks to a coalition of computer scientists, historians
and LGBT activists, we have this year a chance to mark and celebrate another contribution to Britain’s fight
against the darkness of dictatorship; that of code-breaker Alan Turing.
Turing was a quite brilliant mathematician, most famous for his work on breaking the German Enigma
codes. It is no exaggeration to say that, without his outstanding contribution, the history of World War
Two could well have been very different. He truly was one of those individuals we can point to whose
unique contribution helped to turn the tide of war. The debt of gratitude he is owed makes it all the
more horrifying, therefore, that he was treated so inhumanely. In 1952, he was convicted of “gross
indecency” – in effect, tried for being gay. His sentence – and he was faced with the
miserable choice of this or prison – was chemical castration by a series of injections of female
hormones. He took his own life just two years later.
Thousands of people have come together to demand justice for Alan Turing and recognition of the
appalling way he was treated. While Turing was dealt with under the law of the time and we can’t put the
clock back, his treatment was of course utterly unfair and I am pleased to have the chance to say how
deeply sorry I and we all are for what happened to him. Alan and the many thousands of other gay men who
were convicted as he was convicted under homophobic laws were treated terribly. Over the years millions
more lived in fear of conviction.
I am proud that those days are gone and that in the last 12 years this government has done so much to
make life fairer and more equal for our LGBT community. This recognition of Alan’s status as one of
Britain’s most famous victims of homophobia is another step towards equality and long overdue.
But even more than that, Alan deserves recognition for his contribution to humankind. For those of us
born after 1945, into a Europe which is united, democratic and at peace, it is hard to imagine that our
continent was once the theatre of mankind’s darkest hour. It is difficult to believe that in living
memory, people could become so consumed by hate – by anti-Semitism, by homophobia, by xenophobia
and other murderous prejudices – that the gas chambers and crematoria became a piece of the
European landscape as surely as the galleries and universities and concert halls which had marked out
the European civilisation for hundreds of years. It is thanks to men and women who were totally
committed to fighting fascism, people like Alan Turing, that the horrors of the Holocaust and of total
war are part of Europe’s history and not Europe’s present.
So on behalf of the British government, and all those who live freely thanks to Alan’s work I am very
proud to say: we’re sorry, you deserved so much better.
Wed, 09 Sep 2009
by Peter Dzwig
The end of August saw the Hot Chips meeting at Stanford. There were announcements that we think deserve
comment, from IBM, Sun and AMD – which in turn have opened many questions.
The diversity of processor architecture among the big players continues and there is little sign of
consensus about which way the market will evolve.
IBM were talking about the Power 7 which will apparently be available in 4, 6 and 8 processor variants,
have 32MB of Level 3 cache and support 4 threads per core. Rumour has it that it will be among the fastest
processors available, though whether it will surpass Fujitsu's SPARC implementation remains to be
seen. Sun's Rainbow Falls (SPARC T3) processors have 16 cores each with its own L2 cache and of course
being a Sun product supports threading, in this case up to 128 threads. AMD's Magny-Cours offering will
have 12 cores. In reality it's two 6-core Istanbul processors on the same die. Intel weren't very
conspicuous, giving more details of their 8-core, 24MB cache and 16-thread Nehalem EX, although they did
talk a little about their 32nm Westmere chip.
The group viewed as a whole show that designers are now putting a considerable amount of effort into novel
communications architectures and communications speeds, and also in matching caching structures to achieve
the potential throughput in these chips. Many of these processors have faster core interconnects and
faster I/O enabling performance and data movement to be better balanced. For this generation, and the ones
beyond it, the ability to move data between cores is going to be crucial.
With the exception of Westmere, this collection is slated for release in the course of next year. So
2010/11 is expected to see the evolution into double-digit cores of many top-end servers, which is where
this group is mainly targetted.
The real question: “whether or not software will be able to use this power” is the key to
the commercial success of these processors and others that follow in their wake. In the main, many
commercial server-based systems use one (fast) core or perhaps a pair of cores to deal with replicas of
the same process. However only a fraction of the potential performance is being reached in this
way. Proper parallelism is a way off yet in mainstream applications. For the hardware industry, however,
its throughput profile makes the high-end server market the obvious point at which to introduce these high
core count architectures.
Today in most people's terms quad-core is regarded as mainstream and fairly high performance, so what is
being proposed here is a big leap forward, even for high-end servers. The faster internal structure means
that the architectures are, as a whole, becoming more balanced and so opening up to faster data streams
both in I/O terms and among cores. However, the software industry and the peripherals industry haven't
caught up yet. These new processors are becoming really data-hungry and there aren't yet that many
applications around to take advantage of them.
Obvious industrial applications lie in the broadcast industry and in other media applications, including
of course the Internet. The question is how long before the mainstream user catches up and how will they
use the processors then.
Thu, 03 Sep 2009
by Peter Dzwig
As the summer vacations ended out came news that probably cheered the hearts of many an HPC programmer
– and possibly a few investors too – that Cray had succeeded in acquiring several of
SiCortex's assets including the PathScale EKO compiler suite. PathScale's suite provides 64-bit C support
as well as C++ and Fortran compilers for Linux-based environments.
In a move which may have surprised some, Cray will use some of those assets internally, but will also
partner with the open source world through a combination of existing PathScale engineers and NetSyncro.com
who will continue to develop the compiler, provide support for users, re-brand their efforts as
“PathScale”, and be supported by a “new PathScale” company. Netsyncro.com is a
open-source oriented group of long-term UNIX and Linux developers with a wide range of experience. This
new structure will enable Cray to use some of the PathScale assets to develop its own IP, while permitting
existing users of PathScale on Cray to keep using their favourite tools. In addition, while important for
many standard “Cray-style” HPC applications, PathScale's suite is also used on platforms other
than Cray. The licencing for these is, according to the community site, being sorted; so it looks like
no-one will miss out.
Cray's future toolchain continue to provide a diverse range of compilers for its boxes with PathScale
sitting alongside offerings from Portland Group PGI server C/C++/Fortran compilers and tools for Linux and
Cray's own CCE.
Interesting side comments have included suggestions that the new PathScale will direct research towards
Sat, 22 Aug 2009
by Peter Dzwig
Last year at – and for that matter after – SuperComputing, we wrote about the lack of
direction in the parallel software market. In subsequent commentaries, we talked about how, unless there
was agreed action across the market place, software could easily become dominated by one company. It looks
like Intel are grasping the bull by the horns.
In a recent bout of acquision Intel has bought into its fold since June (though doubtless the negotiations
have been going on for much longer) WindRiver, Cilk Arts and now RapidMind.
It's an interesting mixture: the dominant embedded software provider – at least as far as UNIX/Linux
systems are concerned – and two smaller, but very innovative, software companies that appear to
complement Intel's existing offerings rather well.
Cilk Arts was an MIT offshoot, targeting dynamic and highly asynchronous code. Its C (Cilk) and later C++
(Cilk++) directed offerings align well with Intel; in fact James Reinders said in a blog piece that
Intel's Threading Building Blocks (TBB) was inspired by the work of Cilk. So Cilk's Cilk++ offering
clearly fits at several levels.
RapidMind's technologies are widely admired throughout the industry and it is hardly surprising that Intel
will continue to market RapidMind's offering (though whether under a new label or not, and for how long,
remains to be seen). Its tools could fit well with Intel, depending on which way Intel is moving. But we
must presume that Intel wouldn't have acquired them were they not broadly heading down the same path.
WindRiver, at a price of some $884 million, is by far the largest of the three, but offers an interesting
synergy. It has been suggested that the embedded Linux expertise will sit well with the Larrabee
architecture, as the chip has to run its own operating system in order to be able to handle the
complexities of its architecture and integrate with the more mainstream device achitectures currently
featured in most boxes.
WindRiver also brings considerable mobile expertise. The whole area of mobile devices is one that Intel
has targetted for a number of years and which is growing in importance both for the company and for the
market as a whole. The current continuing growth in markets for all categories of mobile device, as well
as the sheer number of new types of devices that people talk about, means that the area will likely
continue to support substantial growth over the next few years. Intel is already a major player there and
its strategies would clearly benefit through the acquisition.
So where dos this leave us with respect to parallel systems? Well, it clearly enhances Intel's parallel
tools and probably presages a broader development of C++-related technologies to complement its TBB,
OpenMP and related offerings; and it provides, at the very least, the ability to expand its tool chain.
But what of the opposition? They haven't as yet replied in similar fashion. As we have said before, we
could get once again to a position in which one large or influential company sets the course of software
for years to come.
Part of the problem is the sheer diversity of parallel systems. They extend from asynchronous, highly
asymmetric embedded systems to highly symmetric ones and with processor complexities running from
relatively simple architectures to the extremely complex. The program development issue is then further
complicated by the sheer variety of algorithms involved.
Does this mean that there is no single solution to the issue? Possibly not. We have yet to find a single
language or development environment that can embrace all types of applications. That may be something that
may not change. For the moment, though, Intel is forcing the pace.
Fri, 14 Aug 2009
Russel Winder gave a presentation
at UKUUG Summer 2009 conference entitled Shared-memory
Multithreading is the Wrong Way to do Parallelism, the slides can be
As well as emphasizing the move towards lightweight processes and message passing, cf. Erlang, Scala,
etc., the session raised the question of whether current operating systems would be up to the task of
managing systems with multiple processors, each of which had thousands of cores all using distributed
memory – single central memory architectures are untenable in the presence of very large numbers of
Wed, 12 Aug 2009
Concertant's Francis Wray has been appointed
Visiting Professor in the Faculty of Computing, Information
Systems and Mathematics at Kingston University. He has a
very strong background in parallel processing which has recently become highly topical due to the advent
of multicore processors, and is very relevant to the work of the Department.
Concertant has a long tradition of working with a wide variety of institutions. We can apply our unique
blend of expertise in areas as diverse as aerodynamic simulations and embedded systems to the benefit of
For further information please contact: firstname.lastname@example.org
Mon, 22 Jun 2009
by Russel Winder
There is much speculation lately around the future of the Rock processor, now that (as seems highly
likely) Sun will be absorbed into Oracle. Many of Sun's businesses will find a place in the Oracle
structure: the storage business, the server business, the OS business, the Java business all have ways of
being absorbed into a sensible Oracle strategy for growth. Sun's processor development business on the
other hand seems out of place in this context.
Analyst and journalist speculation is that the most likely outcome will be that Sun terminates the
processor development business prior to the Oracle take-over. The end of the Rock processor will not
stall the rise of multicore processors: multicore is now the norm. Rock's 16 cores are now nothing
remarkable. Being able to run two threads per core is no longer remarkable. Would the demise of this
processor, which has been five years in the making so far, be at all remarkable? Well yes.
Rock was going to support hardware transactional memory. To loose this is indeed something to remark on.
Shared memory concurrency is hard. Shared memory parallelism is even harder. The problem is
synchronizing access to storage being used by multiple threads. Programmers generally get it wrong. The
problem is that the tools of locks, semaphores and monitors are all too low-level for the average
application programmer. What is needed is a higher level of abstraction. The same happened with memory
allocation: explicit allocation of memory by programmers led to unmaintanable programs containing many
errors, and lacking in portability, so new abstractions were introduced to make things workable.
Transactional memory is one technique being proposed for ameliorating the problems of synchronization in
shared memory parallel systems. Experiments with software transaction memory have been very encouraging.
However being in software they have some performance issues. Hardware transactional memory would really
have been a revolutionary step forward.
So if the Rock processor will never be manufactured, will Intel, AMD, IBM, etc. step up and add hardware
transactional memory to their chip lines?
Fri, 19 Jun 2009
by Peter Dzwig
The Carter report on “Digital Britain” calls for “universal connectivity” with
bandwidths of 2Mbit/s nationwide. Is this a realistic target? Is it a worthwhile target? Probably not.
Looking at the growth in Internet traffic and bandwidth over recent years, that target figure seems
absurdly low, let alone sufficient to make the UK “the global Internet business capital”. As
we all know, a promised delivery rate generally falls far short in practice. If the rate were being
promised to your front door, that would be a great deal better, but the report doesn't say that it
is. Surely the £150–200 M or so that will be raised by a £6 annual levy on fixed
lines to help fund the service won't actually go very far towards delivering a reliable high-speed net. In
many other nations, target bandwidths that are much higher than the UK's would-be goals are already being
In the UK perhaps more than in many countries, delivery is very uneven. A quick check
on SpeedTest reveals that, in the UK, Ceredigion has the fastest
average download speed at 10.2 Mb/s with upload speeds of 7.7 Mb/s. A glance at the relatively
affluent, high population density commuter areas such as Hampshire and Surrey shows very large areas in
which speeds are well below 1 Mb/s in realistic terms. Many feature as “notspots”. The
reasons given by industry for there not being greater connectivity is economic. In SpeedTest's global
lists UK features 54th in the upload speed league and 41st by download speed.
There is no doubt that the aim of building a high performance UK-wide network is well-worth achieving, but
that is a very substantial undertaking. But why go off at half-cock? The only real solution is going to be
the provision of a nationwide fibre network capable of taking the traffic that we are going to need in ten
years time. Not just today. Such a system has to provide much better speeds in terms of both upload and
download. While people concentrate on download speeds, next generation interactive applications, video
conferencing, video streaming, image download for printing and gaming need for speeds to be greater in
both directions. This need will only increase. It would seem logical that, in order to be able to deliver
this we need to put in a national fibre network and to do that it is going to take more than a couple of
hundred million per year. In that case one is forced to ask whether the Carter strategy is right and
whether it is not too modest in its aims, laudable though those may be in the short-term.
Emerging technologies, hardware and software are behind this, particularly in embedded applications.
Applications are being built to take advantage of user demand because these can be realised with increased
processor ability available and coming over the next few years. The growth of processor capacity has
driven by new technologies. Key among these is multicore...which is why you are reading this here!