The Concertant Blog

  RSS Feed.

Sat, 26 Feb 2011

A new, cheap way to benchmark clusters?

by Peter Dzwig

A perennial problem has been how to benchmark clusters – and if benchmarked what indeed does it mean. Oh and can I do it cheaply and easily and without any preconceived notions of what the system should be?

Well, we now have the beginings of an answer, to which people are invited to contribute. Raul Gomez has announced the beginings of a Sourceforge project to create such a product, based upon the results of his thesis project. The project, called ClusterNumbers. It must be made clear that ClusterNumbers isn't unique in the world, but it is open source and based upon some sound starting points. See, for example, those in The HPC Challenge of Luszczek, Dongarra et al.

The idea is that any user with a modest level of skill ought to be able to benchmark their system in order to be able to have a grip on how their distributed application should run on a particular cluster. The tool therefore should address itself to the major issues affecting cluster performance and make them readily accessible via a single user interface, and offer to the user a basic configuration for carrying out the benchmark and identify factors impacting performance.

The core of ClusterNumbers is a set of packaged benchmarks for CPU, memory, networking, etc., that can be accessed either individually or as a whole. These are HPL (High Performance Linpack) across the cluster; DGEMM providing matrix multiplies on cluster node; FFTE tests for CPU execution rates by running discrete FFTs; STREAM to measure CPU/memory performance; disk performance is measured using IOTRANS; and network capability under various loading is measured using Netperf and PTRANS. The observant reader will have noted that many of these are scientifically-oriented benchmarks and the majority are those used in The HPC Challenge, nonetheless here the aims are somewhat different and they are being supplemented by others.

ClusterNumbers allows the user to select the kind of benchmarks to be run from a PC window that communicates with a daemon that runs on the cluster's admin node. The selection is then made to run the appropriate subsets of the benchmarks listed above.

Getting the members of the FOSS community involved through an Open Source project in Sourceforge indeed seems a logical step and according to Gomez there has already been a strong response from the HPC community. And therein lies an issue. While without doubt the high performance community lead a lot of fields and stress systems in ways that make their contributions to projects such as these invaluable, it is important that those who run clusters in other environments contribute. Their presence would give ClusterNumbers a broad following and ensure that Gomez' work is not “just another Sourceforge project”. After all some of the most intensive users of clusters in the world are very much not conventional HPC users although their systems are certainly high-throughput. Perhaps they should include something like DBT2 or an appropriate derivative as a starter.

The first steps are to create a roadmap. The wider the input at that stage, the better for the long term viability of the technology. I urge those outside the HPC community to make their input in the interests of giving ClusterNumbers a wider user base than might otherwise be the case.


Sat, 05 Feb 2011

Rushing the Data Around

by Russel Winder

Till recently there was an implicit understanding throughout the various computer-based industries that data meant relational database meant SQL. A lot of data fits very nicely into the relational model, but not all does. A lot of queries on data (held in relational databases) is easily expressed using SQL, but not all are. Trying to enforce all data and queries into the relational/SQL model leads to a lot of problems in some cases. Hence the NoSQL movement.

The likes of Cassandra, MongoDB, CouchDB, etc. are taking the world by storm since they provide non-relational data storage and querying facilities that works much better than the relational one for some problems. There is a potential for the pendulum to swing too far, of course, and for problems well handled by relational frameworks to be forced into NoSQL frameworks. This would be a shame.

The future is clearly heading towards a “mixed economy” of relational/SQL and NoSQL. A key decision for storage-based applications – which increasingly means every application there is – will be whether to be relational or not. An issue to be solved by the analyst/designers on a case by case basis. However there is another issue and that is analysis and processing of data. It remains the case, at least until NoSQL systems have much greater impact on analytics generally, that doing analytics and data mining is seen as an application that employs SQL to access data. Fortunately, there is a third way.

The core of the current problem for most analytics activity is that any application based on SQL queries cannot harness the mutlicore and cluster hardware architectures that are now the norm. Analytics application cannot therefore make use of all the parallelism available. Big problems can therefore not only take hours, days, even weeks, but bringing in new hardware with more cores and/or more cluster nodes will make little or no difference to the execution time of the application. Given that the future is one of massively increasing amount of data, analytics appears to be in deep trouble. What's the solution? Change the paradigm. Enter Pervasive DataRush.

Pervasive DataRush is a software framework that implements a dataflow architecture. An dataflow structured application is written as a collection of operators connected together by uni-directional channels (things down which data can flow). An operator can have many input channels and many output channels. An input channel can be the output channel of another operator or come from a data source. An output channel either connects to another operator as one of its input channels or it goes to a data sink. An operator is an event driven process, where the event that triggers execution is some state of the input channels.

This is a very different view of computation from the shared-memory multithreading view, be it object-oriented or procedural. The dataflow model requires operators to be processes, so there is no shared memory between operators. This means operators are highly parallelizable using all the processing capability the hardware has. The execution triggering events are all the synchronization that a computation needs, no locks, no semaphores, no monitors, no programmer confusion. The dataflow approach may seem strange to programmers brought up on shared-memory multithreading, but once you “get it”, dataflow makes for easier, faster, and less error prone programming. Moreover the result is an application that can easily and naturally handle as much parallelism as the problem can handle. The more processors you throw at the execution, the quicker it finishes. Up to the theoretical maximum that is possible, of course.

So whilst dataflow is not a “silver bullet”, it is about as close as we are likely to get as far as implementing algorithms is concerned, at least in the short and medium terms.

At the European Data Integration Summit 2011 (EDIS 2011) event held in the London Bridge area 2011-02-02 – see this websitePervasive Software announced version 5 of Pervasive DataRush. We are currently assessing this new version, which is a relatively radical relabelling and restructuring of the framework compared to version 4. Initial impressions are that the changes are favourable. Certainly the labels are more consistent with, and less idiosyncratic compared to, the underlying dataflow model. At the time of writing, none of our sample applications run using the new API. However, I think it bears repetition, the new API appears to be better labelled than the old. At this stage, if I have an adverse criticism, it is that there needs to have been some more technical authoring on the documentation pack.

Prior to the EDIS 2011 event, we had a briefing session with Jim Falgout (Chief Technologist, Pervasive DataRush) and Ray Newmark (Director of Sales and marketing, Pervasive DataRush). We had previously talked with Jim in early in 2008 and then again at SuperComputing 2008 (SC08) – and wrote articles about Pevasive DataRush at that time. Is Dataflow the New Black? and Pervasive Software’s Datarush – so it was something of a reunion to meet with him last week. Ray has been on board for about 8 months and seems to have provided the direction in terms of marketing that we felt was missing in 2008. Furthermore having Pervasive DataRush with a definite strategic place in the Pervasive “end to end” integration architecture has given Pervasive DataRush a definite role in something bigger. Pervasive DataRush is, of course, a separate product, and can be used independently of the rest of Pervasive's offerings, but without a place in Pervasive's offerings Pervasive DataRush was a little out in the cold.

Overall we think that Pervasive DataRush has a very rosy future, something we perhaps couldn't have said 2 years ago. It is a good product, with a clear role in an overarching architecture, as well as an independent existence.

The Actor Model, originally presented sometime in 1973, has been getting a lot of press recently, particularly as it is the model of concurrency and parallelism in Scala. But it is noticeable that dataflow frameworks are appearing for Scala. Indeed within the Groovy milieu, there is a framework that provides not only actors, but also dataflow – and indeed CSP (communicating sequential processes) – GPars, cf. here. So the FOSS community are beginning to put out dataflow frameworks for JVM-based systems. This strongly validates Pervasive DataRush as the direction of future computation in the increasingly multicore world.


Fri, 04 Feb 2011

Pervasively Cloudy – but that's good!

by Russel Winder

2011-02-02 was the day of the European Data Integration Summit 2011 (EDIS 2011) event, a one-day, four track, marketing conference held in sight of Tower Bridge London, organized by and for Pervasive Software Inc., its services and its products – see this website. The day prior, Pervasive organized a number of analyst briefings. In one of them, we talked with John Farr (President and CEO of Pervasive) and Mike Hoskins (CTO of Pervasive) about the company, its products, its history, and its future direction. The following stems from those discussions, with some fill-in from the conference sessions.

15 years ago Pervasive was principally a database provider, operating successfully in the not-DB2, not-Oracle database space – Pervasive was born out of a name change of BTrieve Technologies Inc. (cf. this Wikipedia page for more details on the history). Over the last ten years, Pervasive has moved increasingly, and profitably, via various acquisitions, into the “integration” space – whilst at the same time maintaining its successful database business. Many of Pervasive's competitors have come and gone, usually by being bought by one of the “biggies”. Pervasive has continued as an independent player, continuously reporting profits, indeed growth – generally well above inflation. A good investment.

Pervasive is not though resting on it laurels. Far from it. The company invests 25% of profits back into R&D, and that is mostly R rather than D. Pervasive is investing more heavily in R&D than might be thought of as normal for a company such as this, because it wishes to be at the forefront of innovation.

Innovation at the moment is generally seen as making the Cloud work and be relevant. Pervasive is at the heart of this, which is somewhat essential for a company which is more and more emphasizing its “integration” business. In the Cloud, we have:

  1. PaaS (platform as a service), but Pervasive is not really in this business, they are leaving it to people like Amazon, Eucalyptus, etc.
  2. SaaS (software as a service), but Pervasive is not really in this business, this is for the likes of Google and Microsoft.

Pervasive is a "data company", their interest is in providing infrastructure for customers to manage their data. Pervasive's integration products allow people to connect their various sources of data in whatever way they wish. This means software can be moved to where the data is, or increasingly common in the more and more Cloud-based approaches, move the data to the program. Pervasive are looking to be the market leaders in DaaS (data as a service).

Pervasive clearly have a strong vision of how to provide innovative Cloud-based framework to their customers now and in the future. Which is good, but not really anything to do with parallelism, multicore and cluster computing. So what is the interest for Concertant?

Part of Pervasive's strategy is to use internal “startups” as a cornerstone of its R&D policy. A group gets “spun off”, albeit actually internally, to work on something. The two currently running are Pervasive Data Solutions and Pervasive DataRush. It is Pervasive DataRush that really piques the interest of Concertant. Pervasive DataRush is a dataflow framework, a software architecture that is neither new nor currently seen as mainstream. The process and message passing basis of dataflow is what makes Pervasive DataRush interesting, able to harness multicore and cluster parallelism, and the reason it will be successful.

In fact Pervasive DataRush has been going for a while – the release of version 5 was a big announcement made at EDIS 2011. Moreover we have interviewed Jim Falgout (Chief Technologist, Pervasive DataRush) previously, early in 2008 via telephone conference, and then in person at SuperComputing 2008 (SC08) – see Is Dataflow the New Black? and Pervasive Software’s Datarush. We will address our perception of the technical progress of Pervasive DataRush in another article. For the moment, it is the strategic importance of this product in Pervasive's portfolio that is interesting.

Pervasive has got the multicore and clustering bull by the horns. In version 11 of its database offering, Pervasive SQL, there is support for multicore processors as well as 64-bit processing and IPv6. Now it has placed Pervasive DataRush as its data analysis, analytics, data mining offering. Not a part of the data movement around the Cloud, but core to its integrated offering. This integrated offering is marketed under the label Pervasive DataCloud which is not PaaS and not SaaS, it is definitely a data-oriented framework that sits over PaaS, employing SaaS.

Pervasive is tiny compared to the IBMs, the Oracles, the HPs, the EDSs of this world, but it looks as though that is exactly why it is remaining a very successful company. It is providing highly integrated, low-ceremony, low-overhead solutions – something required by SMEs, which cannot be provided by the “biggies”.

Whilst Pervasive remains profitable and innovative, I truly wish it remains independent and does not become the target of acquisition.


Mon, 25 Jan 2010

Google, China, India, France and Digital Britain

by Peter Dzwig

Probably about 25 years ago I wrote a report for a client who was thinking of opening up shop in Russia. Part of the core of that report was that not every society has the same (market) traditions and cuture as those in the West. That may or may not seem obvious. The client wanted to enter the Russian (at that time recently-Soviet) market with an American-style market proposition. My view was – and still remains – that the market wouldn't necessarily adopt the western model rapidly, if at all. We are all aware that the Russian "model of capitalism" is very different from the western one, let alone the US version. Still more so the current Chinese version of capitalism "with a Socialist face", we have no idea when, if ever, it will become like the western version. In fact all the evidence is that the two are actually diverging.

So it perhaps should not surprise us that the Chinese model of the Internet is also so different from ours. China is a 3,000 year-old civilisation with a long tradition of insularity, of which the current China is merely the present manifestation. For a long part of that time it has looked down on the outside world and in effect closed its doors against the outside world. Therefore I was fascinated by an article in the FT) of 20th Jan about the so-called "Chinese Firewall". It didn't come as a great surprise that Google and the Chinese aurthorities should have had a run-in, one was surely inevitable if not necessarily imminent. There is a fundamental tension betweeen the Chinese way of doing things and the western way of doing things.

Here I am specific in my use of the term "western" as countries such as Japan and Korea, as well as a number of South-East Asian countries, sit along a spectrum between the western model of the Internet and the Chinese. The fact that China has a much bigger population than any other country and is "opening up", is seen by many in the West as a huge potential market for their companies. To the Chinese it probably appears as a wholly different proposition. It is not clear that the model that the perspective that the Chinese adopt is anything like the rest of the world's. After all their population is about three times that of the EU and bigger still than the US, they have a huge market for their own technology and do not have to be behoven to the outside world. So perhaps we are wrong to be surprised at western perceptions of China's attitude towards such as Google (and they aren't the only ones). I am not referring to alleged attempts to hack dissident's Googlemail accounts, but to the overall marked divergence in attitudes between China and Google going back to the point at which Google entered China.

More immediately salient is that, as the FT article shows, the Chinese usage profile of the Internet is different from that in the West. Certainly there are areas that are very different because of the sensitivities of the Chinese government to social networking sites; but what struck me was figures from McKinsey quoted in the FT article about the general profile. The next part is broadly a summary of those, for which I take no credit. A Chinese person is likely to make (all figures I am going to give are rough) 2/3rds as much usage again of the net for email and searching for information as a European counterpart; a Chinese person makes only one-eighth (!) of the use of the net for work-related purposes as a European counterpart; but 60% more use for gaming; for chatting/instant messaging usage is 235% of European usage; and almost 80% greater for downloading films or music.

A proportion of the diversity of these figures might be laid at the door of lack of social networking sites (email traffic) and Chinese attitudes to intellectual property, in particular copyright (downloading), but not exclusively. The Chinese usage model for the net appears to be one of a gigantic playground. For most net-using westerners with a tradition of research, particularly in Europe, the net is much more of a space within which to find out information as well as to communicate with friends. That is not to say that westerners don't game, download games or chat; but in China the figures are much greater.

I don't actually want to comment too much further on the Chinese model of the Internet except to say that it offers an alternative profile of usage to the typical one that we have adopted in the west. The Internet has become what it has become in the rest of the world because of the model that the rest of the world has adopted, driven by a US-centred model. Is that the only model? Should we not at least consider the alternative ways of using the Internet, and what that might imply? If we look at other usage models then perhaps we could learn for the future and indeed could plan our own networks better. I would be fascinated to see what the comparable figures look like for other emerging economies as they evolve over time. India and Brazil come to mind here. It would also be interesting to see how those evolve over time, what the regional evolution is like and also how it has developed the past.

In France in particular, there is a debate going on at present as to how to deliver much greater bandwidth than they have at present, including to rural populations – and this is in a country where substantially higher rates than the UK has are the norm. The Digital Britain plan to deliver 2 MB/s (max) to the door would be woefully inadequate if we were to look at a model in which there were a lot more gaming, chatting and above all downloading of movies/music and on-line TV. There are two sides to this problem, one is typified by the usage model as above. The other is – and this is why this appears here – that a network's characteristics in terms of bandwidth needs are set by the technologies that are coupled to it. Processor speeds are growing and will go on growing. Multicore means that that is a practical reality, that after all was its rationale. That will increase demand on the ability to download and upload – and not just for the user, but for industry as well. Thus network speeds are a factor in economic performance; lack of delivery will ultimately be a barrier to economic competitiveness.

The Digital Britain plan is woefullly inadequate, both in respect of technology (bandwidth) and in terms of delivery targets. It is also not going to address the Uk's need to be able to compete. Even were we to adopt a less business and more "Chinese" model of work, where more than raw speed is the issue; quality of line, latency and so on are more important there we would fall short becuase most of the network in the UK is inadequate to deliver to most people. If the target does not change it will do little to reduce "notspots" in relation to average speeds. In fact the concern is that they may even expand. To put some figures to this for a moment, if you look at a href ="http//www.speedtest.net/SpeedTest"> SpeedTest, for example, the UK (by their sampled speeds) ranks 41st globally by download speed and 64th by upload speed. For the "global leader in technologies" that the government aspires to be, these are not good figures. Yesterday (22nd January) BT has announced that it will deliver 40 Mb/s with its service to a limited number of subscribers rolling out this year and "reaching" 4 million by next year. The fact that up to date technology can already deliver well over 50 MB/s in real usage perhaps says more about the ailing state of the UK infrastructure than anything else.


Tue, 15 Dec 2009

Was Intel right to kill Larrabee?

by Peter Dzwig

Predictably the fuss about the demise of Larrabee lingers on.

Let's start at the beginning. Intel has said that the processor will have a continued life as an SDK of some form, and that it will be available to various types who have expressed a desire to be able to use it. Importantly there is Intel's offer for the HPC community to be given access to it.

The reality has to be that Intel could no longer see the market opportunity for the technology. Surely at one time they could do so: it needed that, whatever other possibly apocryphal stories are told about its origins, in order for the management to give it some hope of seeing the commercial light of day.

Had the design run out of engineering steam? Possibly. Larrabee was expected to have seen light of day in the second half of next year. By then the other players in the market will have pushed further ahead. The more so since all the hype will have focused their minds on doing so. By which time perhaps Larrabee would have looked architecturally interesting but behind the curve. So it is far from impossible that the high-ups on the engineering side just decided that they weren't going to be able to squeeze enough out of the technologically. From an engineering perspective they will have learnt many lessons. In fact, we can be almost certain of it. There are even a few hints of Larrabee in the SCC; not many, but some. Larrabee should really be seen as a test bed for ideas: about graphics processors, about memory disposition, about interconnects, programmability and much else besides. That's its real long term value for Intel.

The vacillations over the last couple of years or so, during which we saw specs change and configurations develop and then disappear, has contributed to the market's decreased desire for Larrabee – though most of this downward drive has been driven by people who didn't really know the product. Its slot has meantime largely been filled by NVIDIA et al. and the goalposts have really moved. For Intel this meant that had they ever got to market they would have come in, a perhaps distant, third. That doesn't make business sense for them.

Looked at from a purely commercial perspective then, the decision to remove Larrabee from the likely product list seems entirely reasonable and perhaps inevitable.

In conclusion this was the right decision to take. The technology has gone as far as it can or at least as far as Intel wanted to take it. There has been substantial value added to Intel's business by the teams that contributed the engineering skills and intellectual property, and finally it will be made available to internal and external developers and the HPC community as a development platform.


Fri, 11 Dec 2009

Grails in the Cloud

by Russel Winder

Groovy is more and more rapidly gaining traction in the Java community. The JVM is becoming the standard hardware independent platform for almost all new applications -- especially those that are Web-oriented. Polyglot programming is rapidly becoming the norm: systems are developed in some mix of Java, Scala, Groovy, Clojure, Jython, and JRuby. Until recently Jython and JRuby were being directly suported by Sun. However they have been ejected from Sun's corona as part of the purchase of Sun by Oracle.

Groovy has, since its inception in 2003, been developed by the open source community as a Codehaus project. Inspired by Ruby on Rails as a web application development platform, the Grails project was born, again driven by the open source community. But there is commercial development interest. The company G2One that was formed by the Groovy and Grails Project Leads, was bought some time back by SpringSource (who own Spring). They then put quite significant resources into Groovy and Grails development and most especially into Eclipse support for Groovy. SpringSource's interest in it was motivated by the fact that Grails was beating "Ruby on Rails" in the commercial arena; that it uses Spring (and Hibernate) under the hood; and that Grails is the easiest way of developing Spring-based applications that there is.

SpringSource has in its turn recently been bought by VMWare. So whilst Groovy and Grails are still owned by the community, VMWare is now putting resources into development via SpringSource, but guided by VMWare's commercial strategies. This means virtual machines and clouds.

Graham Rocher (Grails Project Lead) yesterday gave a presentation at Groovy & Grails eXchange 2009 in which he outlined what is coming in Grails v1.2, to be released within two weeks. Using virtual machines for deployment and getting into The Cloud were clear messages. Currently The Cloud more or less means Amazon (which may not acceptable for many businesses) but there was also the “private cloud” idea: business having internal clouds and using virtual machine technology to make applications deployment easier and isolated from the outside world. VMWare's hand in this message is rather clear, even though the presentation was branded SpringSource!

Grails, and on the back of it Groovy, is now being made ready for prime time: Grails version 1.2 and Groovy version 1.7.0 are being rolled out before the end of the year providing the base for next year's new crop of Web applications.


Thu, 10 Dec 2009

Intel's 48-core SCC processor, Terascale, Larrabee and processor futures

by Peter Dzwig

So Intel have canned Larrabee and gone to 48-core clouds on chips. Is it really that simple?

The short answer is “yes” and “no”.

Intel have been looking at a variety of architectures over a period of time. This is an obvious step: if they are going to change their core counts beyond the (relatively) few cores that they have on production chips at present then they need to understand what the issues are going to be and what design strategies are useful. The Terascale chip (aka Polaris) was a very different beast from the recently announced 48-core processor, being 80 VLIW-core based so looking nothing like an x86 configuration and described by some who knew as barely programmable. Nonetheless there were apparently a lot of lessons. Intel describe the latter as “primarily a circuit experiment” and SCC as “a circuit and software research vehicle”.

The current processor is (apparently) readily programmable, being “IA-compatible” and so can run off-the-shelf apps. It's uses message-passing shared virtual memory and actors. It is also made on 45 nm technology. However if the announcement is to be taken at face-value then this too will not make it to production.

Its not that surprising that Intel appears to be slowing down work on Larrabee, given its on/off history and changes of configuration, rumoured or otherwise. It is very unlikey though that Intel would want to loose the experience and technological benefit gained from developing it. That's not how engineering progresses. My guess is that some of the developments will re-appear in some shape or form in future products.

These are all steps along the road.

If I were being asked what will make it to production my guess is that it will be not look a great deal like any of these. A heterogeneous hybrid with several different types of cores, some targeted at specific problems might be closer. Whatever does appear will contain lessons that have come from all of these processors, and from all Intel's other multicore processors – and Pentiums too. Probably the 48-core system is somewhat closer to what is likely to be reality than Terascale ever really was. Given the predicted growth in numbers of cores on a chip then for reasons of engineering and programmability it would appear that distributed memory and some kind of network is the way to go. IA will most probably be implemented in some shape or form, if only for backwards compatibility.

It is also interesting to speculate how Intel will address its future markets. Traditionally the embedded market has seen different architectures. However the commercial challenge of widely differing novel architectures will be great. Whether or not this leads to some design rationalisation is still to be seen. Would multiple, possibly divergent, processor lines make commercial sense in the nearer term?


Mon, 09 Nov 2009

ARM and FPGAs – and a lot else

by Peter Dzwig

ARM are always a company worth noting, if only because they dominate a market sector (processors for mobile devices) even more completely than Intel dominate the PC market. According to figures currently being bandied about, ARM hold in excess of 95% of the current mobile market. According to some that goes as high as 98%. What is perhaps an even more important measure of that dominance is that most handsets have 2–3 ARM processors. That is real market dominance.

Companies such as NVIDIA, Qualcomm and others are using ARM's processors to move the market for netbooks and notebooks ahead. This is an area in which Intel sees itself and its Atom as having a natural dominance. That is clearly not the way that ARM and its collaborators anticipate that things will turn out.

A few days ago, ARM had its annual technology meeting in “The Valley” around which clustered a number of announcements. Perhaps the most interesting for us was the link between ARM and FPGA manufacturer Xilinx. The collaboration owes a lot to the finalisation, or near finalisation, of the AMBA bus specification. Xilinx can now see – and are keen to tell the world – how the combination of FPGAs, ARM's Cortex and AMBA fit together and how AMBA may become a solution for on-board FPGA communications. AMBA is not an ARM.

While AMBA is almost thirteen years old it has now reached a level of maturity where it is now seen as a product capable of delivering pretty much everything that an embedded designer is looking for. In that respect at least it is regarded by many as the de facto 32-bit embedded standard.


Wed, 28 Oct 2009

Tilera Gx – “100 is an odd number”

by Peter Dzwig

Tilera first made a name for itself a couple of years ago with the Tile64, which we wrote about at the time. Now Tilera have announced the Gx series of "tile"-based processors. A development of the earlier Tile64 and TilePro chips, the Tile Gx can have from 16 to 100 tile processors on the chip. These form a homogeneous array of 64-bit VLIW processors with a 64-bit instruction bundle, interconnected by a mesh network. The pipeline is three-deep and can handle up to three instructions per cycle. The whole is programmable using C or C++ via the GCC compiler, and can run Linux. Tilera have an Eclipse-based IDE.

Claiming to be “the world's first 100-core processor” and to “offer the highest performance of any microprocessor yet announced by a factor of four” the PR is a little over-hyped. However the Tile Gx is likely to be an important chip in its target sectors. This is essentially the embedded markets covering the gamut of high performance applications such as advanced networking, wireless infrastructure and digital video. These don't surprise any Tiler-watcher. However the addition of Cloud computing as an applications area shows that they are starting to move away from their traditional markets to look more broadly. All that the PR says is that suitable applications may lie in areas such as LAMP servers, data caching and databases. Whether this means that anyone is already running a corporate database on a Tile system is not made explicit. There are, though, applications in databases and data processing applications which are well-suited to multiple data pipelines.

The Gx does appear to offer some real performance leaps and some very interesting architectural novelties. The 100-core angle is really just what it says. We would be hard pressed to think of another processor with 100 cores, 80 yes (Intel Terascale), 90 yes (Cisco), over 100 yes (many of them, some even saw the light of commercial day); but exactly 100? We can't think of one for which you could produce a product spec-sheet! The Tile Gx series comprises Gx16 (16 cores, 4x4), Gx36 (36 cores, 6x6), Gx64 (64 cores, 8x8), and the Gx100 which unsurprisingly has 100 cores in a 10x10 grid. The performance claims will need some practical justification, but that will have to wait till silicon is available.

Although the chips arrange the tile processors in a regular 2-dimensional array, problems do not always fit such a structure. The Gx series has routing capabilities to get round this: the programmer can build appropriate networks of processors with the cores and the interconnect and do that without compromising performance. If you look at most applications where you are into proper parallel processing, i.e. mapping directly between cores and algorithm components, then you end up with irregular networks. This is why Tilera's local memory structure (32K L1i, 32K L1d, 256K L2 per tile) is appropriate in a very general purpose architecture.

Sadly you can't expect to see the Gx in your friendly local distributor's catalogue soon. The Gx36 is slated for introduction around Q4 2010 – which the experienced among you may interpret as you see fit. We will though be writing more about the Gx series soon.


Thu, 08 Oct 2009

Groovy Parallelism

by Russel Winder

These days when the average programmer thinks of handling concurrency and parallelism, they usually think "threads". This then leads to horrible synchronization issues and worrying about locks, monitors and semaphores. And in the end the programs generally have non-determinstic errors. At the heart of the problem is shared memory.

Many, many years ago, models of concurrency were proposed -- cf. Actor Model, CSP (Communicating Sequential Processes) -- and these are seeing a huge resurgence of interest with the pandemic parallelism now available on all computers due to the Multicore Revolution. Erlang's parallelism is based on the Actor Model. Scala uses the Actor Model as its mechanism for dealing with concurrency and its subset parallelism. Clojure also makes use of the Actor Model. CSP is springing up with JCSP, Python CSP, etc. Java though is still stumbling along trying to harness parallelism with threads. Until now.

The Groovy community has been discussing what to do about harnessing parallelism for a year or two now. Last year Václav Pech acted and started the GParallelizer project. This was inspired by the work in Scala on the Actor Model and focused on using Groovy as a base on which to write a domain specific language (DSL) to be a coordination language managing parallelism. Till a couple of months ago this had been a one-developer project. Now though it has become a serious, and probably a strategic, multi-developer project.

Rebranded GPars, and now a Codehaus project, Václav Pech is leading an effort -- which includes your current author -- that will undoubtedly see Groovy used as a way of specifying the concurrency and parallelism architecture for many a Java system. Parallelism on the JVM just got very Groovy.


Wed, 07 Oct 2009

Are Linux and Unix being ignored by tools vendors?

by Russel Winder

Over the last few days, there have been various announcements by Intel, AMD, NVIDIA and all the usual suspects, of new or revamped tools to enable programmers to harness the multicore CPUs and GPUs that are now effectively mainstream hardware. There is clearly a yawning chasm between today's hardware systems and the use made of these by today's software systems. A gap that is likely to get bigger before it gets smaller. Hence the push by the hardware manufacturers to ensure there are good tools available. It is purely enlightened self-interest.

Why comment? It seems that unless you are using Visual Studio you do not get access to these tools. Now whilst Visual Studio is a very important “player in the game”, an increasing number of developers use Linux, Solaris, Mac OS X, Free BSD, etc. as their development platform. Moreover, Linux is the majority player in the “operating system for HPC” stakes. It seems a poor strategy therefore to treat these platforms as third-class citizens, or as in many cases simply ignore them – particularly bearing in mind that today's HPC application is tomorrow's mainstream application.

Then of course there are articles talking about how these tool manufacturers are “in talks with” you-know-who about operating system support and tool support, and your mind is drawn to “conspiracy theories” . . . is this just another aspect of the attack on alternative operating systems by a monopolist?


Wed, 16 Sep 2009

Tools for the future?

by Peter Dzwig

At present there are some very good tools out there for supporting parallel program development; from language extensions and compilers to software architecture and design tools. Yet there are none out there which actually deliver the holy grail of parallel programming; to take an arbitrary piece of sequential code and transmute it, in the modern equivalent of the Alchemists' Dream, into code capable of running on any platform and delivering anywhere near ideal performance.

In fact this particular dream is highly unlikely to happen because, in general, parallel code, and in particular parallel algorithms and hardware, are substantially different in form from their sequential counterparts.

Parallel programming, in many diverse forms, has been around as a commercial reality since at least the 1970s when ICL (now part of Fujitsu) launched the DAP (Distributed Array Processor) as an attached processor for its mainframes; you can push that date back further if you include academic exercises and multiple CPU systems. Yet to date the dream goal hasn't been reached. Technologies as diverse as the DAP, SuperNode and its relatives from Meiko and Parsytec, from supercomputers to modern multicores, sought to solve the problem, or at least address it, through the deployment of specialised compilers, extensions to existing languages, or complete new languages. This worked adequately at the time because the user base for each system was limited in one way or another, and many of the would-be users were in research facilities. This meant that they had the time to work out the problems, and modify their code appropriately.

The adoption of multicores as the way to deliver cost-effective performance (by a wide variety of metrics) by the preponderance of manufacturers has meant that market penetration has increased for processors having two, four or eight cores. The demographic of the user base has broadened dramatically as a result. This means that no longer are users prepared to deal with arcana in order to get promised performance, they want it delivered simply. They don't want to see any change in the way that they program and retraining should be minimal if any is needed at all. Up to now the user (except for the specialist) has been hidden from the details of a processor by the operating system and other layers.

As the number of cores on a chip grows – and it will do – the problem of how to realise the performance on offer will become increasingly more complex. The industry cannot expect the user or programmer to learn specific languages or extensions to languages in order to be able to program company X's laptops. It will get more complex still because it will be possible to create highly customised specialist installations. While potentially important where there are particular requirements for high performance, these will reduce the potential for code portability.

Then there are all those different architectures...

What the user will want is to program/develop their program/application once and once only, thereby preserving software investment. Whereas nowadays a modern applications can still run on a 1.2 GHz Pentium (albeit slowly), the question of such backward code deployment will become more complex and eventually downright impossible.

How are we to address this? The simple answer is that we don't know at present. Yes, we could point to a few technologies around at present; but perhaps it is better to ask what the user is likely to want. If portability (i.e. maintaining the value of software investments) is to be the principle criterion, then surely we need to hide hardware changes from the user. If we assume the existence of some sort of operating systems level then we are presumably interposing an additional layer between user and the operating system. One would anticipate that would detract from raw performance, which for those who demand raw performance, would be detrimental.

However while this might be important for certain user communities we must accept that the vast majority of users, and indeed of developers too, don't care – provided that they don't loose “a lot” of performance. This is particularly important as performance improves. What we should be looking at is the proportion that is lost. Provided that this can be limited to a low proportion of the overall figure the vast majority probably don't care. Indeed there is some suggestion that such overheads may reduce over time if history is a guide.

What tools we might run over a large core count system, we don't yet know. It may well be that the tools that we will need don't yet exist. It would be an extremely worthwhile program of research for people to step back and take a long hard look at what we really need. Present assumptions, from almost all sectors of our industry is that they will be like what we have already. What justifies that assumption? If you look at that question in some depth – and that is a part of Concertant's activities – then the evidence that we know how to deal with even 64-core systems (due around 2015) is fairly scarce. There is certainly a paucity of consensus. It takes a good few years to get from the research lab to the market, so work had better get underway soon.

The tools industry is quite probably set to change, conceivably beyond all recognition.


Tue, 15 Sep 2009

Report on a Workshop on Multicore Processors held at IET Savoy Place

On June 29th Concertant organized a workshop on behalf of the UK's Grid Computing Now! Knowledge Transfer Network (GCN!-KTN) to to investigate the consequences for the UK of the multicore revolution.

Around 40 invitees from end-users, industry (including the software and hardware industry), government, other KTNs and academia attended the workshop. The report contains a set of recommendations to improve the UK's competitive position in the global MCP market.

The final report is here.


Fri, 11 Sep 2009

Turing – better late than never?

After a campaign including a petition in the Prime Minister's website, Gordon Brown has finally apologised for the “shabby” way in which Alan Turing was persecuted in the 1950s for his homosexuality in a series of events which led to his committing suicide and which lost the UK the most influential computer scientist. The apology has been the goal of a petition being run for the past weeks.

The text of the statement from the Number 10 website is as follows:

2009 has been a year of deep reflection – a chance for Britain, as a nation, to commemorate the profound debts we owe to those who came before. A unique combination of anniversaries and events have stirred in us that sense of pride and gratitude which characterise the British experience. Earlier this year I stood with Presidents Sarkozy and Obama to honour the service and the sacrifice of the heroes who stormed the beaches of Normandy 65 years ago. And just last week, we marked the 70 years which have passed since the British government declared its willingness to take up arms against Fascism and declared the outbreak of World War Two. So I am both pleased and proud that, thanks to a coalition of computer scientists, historians and LGBT activists, we have this year a chance to mark and celebrate another contribution to Britain’s fight against the darkness of dictatorship; that of code-breaker Alan Turing.

Turing was a quite brilliant mathematician, most famous for his work on breaking the German Enigma codes. It is no exaggeration to say that, without his outstanding contribution, the history of World War Two could well have been very different. He truly was one of those individuals we can point to whose unique contribution helped to turn the tide of war. The debt of gratitude he is owed makes it all the more horrifying, therefore, that he was treated so inhumanely. In 1952, he was convicted of “gross indecency” – in effect, tried for being gay. His sentence – and he was faced with the miserable choice of this or prison – was chemical castration by a series of injections of female hormones. He took his own life just two years later.

Thousands of people have come together to demand justice for Alan Turing and recognition of the appalling way he was treated. While Turing was dealt with under the law of the time and we can’t put the clock back, his treatment was of course utterly unfair and I am pleased to have the chance to say how deeply sorry I and we all are for what happened to him. Alan and the many thousands of other gay men who were convicted as he was convicted under homophobic laws were treated terribly. Over the years millions more lived in fear of conviction.

I am proud that those days are gone and that in the last 12 years this government has done so much to make life fairer and more equal for our LGBT community. This recognition of Alan’s status as one of Britain’s most famous victims of homophobia is another step towards equality and long overdue.

But even more than that, Alan deserves recognition for his contribution to humankind. For those of us born after 1945, into a Europe which is united, democratic and at peace, it is hard to imagine that our continent was once the theatre of mankind’s darkest hour. It is difficult to believe that in living memory, people could become so consumed by hate – by anti-Semitism, by homophobia, by xenophobia and other murderous prejudices – that the gas chambers and crematoria became a piece of the European landscape as surely as the galleries and universities and concert halls which had marked out the European civilisation for hundreds of years. It is thanks to men and women who were totally committed to fighting fascism, people like Alan Turing, that the horrors of the Holocaust and of total war are part of Europe’s history and not Europe’s present.

So on behalf of the British government, and all those who live freely thanks to Alan’s work I am very proud to say: we’re sorry, you deserved so much better.

Gordon Brown


Wed, 09 Sep 2009

Hot Chips

by Peter Dzwig

The end of August saw the Hot Chips meeting at Stanford. There were announcements that we think deserve comment, from IBM, Sun and AMD – which in turn have opened many questions.

The diversity of processor architecture among the big players continues and there is little sign of consensus about which way the market will evolve.

IBM were talking about the Power 7 which will apparently be available in 4, 6 and 8 processor variants, have 32MB of Level 3 cache and support 4 threads per core. Rumour has it that it will be among the fastest processors available, though whether it will surpass Fujitsu's SPARC implementation remains to be seen. Sun's Rainbow Falls (SPARC T3) processors have 16 cores each with its own L2 cache and of course being a Sun product supports threading, in this case up to 128 threads. AMD's Magny-Cours offering will have 12 cores. In reality it's two 6-core Istanbul processors on the same die. Intel weren't very conspicuous, giving more details of their 8-core, 24MB cache and 16-thread Nehalem EX, although they did talk a little about their 32nm Westmere chip.

The group viewed as a whole show that designers are now putting a considerable amount of effort into novel communications architectures and communications speeds, and also in matching caching structures to achieve the potential throughput in these chips. Many of these processors have faster core interconnects and faster I/O enabling performance and data movement to be better balanced. For this generation, and the ones beyond it, the ability to move data between cores is going to be crucial.

With the exception of Westmere, this collection is slated for release in the course of next year. So 2010/11 is expected to see the evolution into double-digit cores of many top-end servers, which is where this group is mainly targetted.

The real question: “whether or not software will be able to use this power” is the key to the commercial success of these processors and others that follow in their wake. In the main, many commercial server-based systems use one (fast) core or perhaps a pair of cores to deal with replicas of the same process. However only a fraction of the potential performance is being reached in this way. Proper parallelism is a way off yet in mainstream applications. For the hardware industry, however, its throughput profile makes the high-end server market the obvious point at which to introduce these high core count architectures.

Today in most people's terms quad-core is regarded as mainstream and fairly high performance, so what is being proposed here is a big leap forward, even for high-end servers. The faster internal structure means that the architectures are, as a whole, becoming more balanced and so opening up to faster data streams both in I/O terms and among cores. However, the software industry and the peripherals industry haven't caught up yet. These new processors are becoming really data-hungry and there aren't yet that many applications around to take advantage of them.

Obvious industrial applications lie in the broadcast industry and in other media applications, including of course the Internet. The question is how long before the mainstream user catches up and how will they use the processors then.


Thu, 03 Sep 2009

Cray gets (some) PathScale assets and helps set up a new PathScale

by Peter Dzwig

As the summer vacations ended out came news that probably cheered the hearts of many an HPC programmer – and possibly a few investors too – that Cray had succeeded in acquiring several of SiCortex's assets including the PathScale EKO compiler suite. PathScale's suite provides 64-bit C support as well as C++ and Fortran compilers for Linux-based environments.

In a move which may have surprised some, Cray will use some of those assets internally, but will also partner with the open source world through a combination of existing PathScale engineers and NetSyncro.com who will continue to develop the compiler, provide support for users, re-brand their efforts as “PathScale”, and be supported by a “new PathScale” company. Netsyncro.com is a open-source oriented group of long-term UNIX and Linux developers with a wide range of experience. This new structure will enable Cray to use some of the PathScale assets to develop its own IP, while permitting existing users of PathScale on Cray to keep using their favourite tools. In addition, while important for many standard “Cray-style” HPC applications, PathScale's suite is also used on platforms other than Cray. The licencing for these is, according to the community site, being sorted; so it looks like no-one will miss out.

Cray's future toolchain continue to provide a diverse range of compilers for its boxes with PathScale sitting alongside offerings from Portland Group PGI server C/C++/Fortran compilers and tools for Linux and Cray's own CCE.

Interesting side comments have included suggestions that the new PathScale will direct research towards multicore architectures.


Sat, 22 Aug 2009

Intel, Acquisition, and the Direction for Tools

by Peter Dzwig

Last year at – and for that matter after – SuperComputing, we wrote about the lack of direction in the parallel software market. In subsequent commentaries, we talked about how, unless there was agreed action across the market place, software could easily become dominated by one company. It looks like Intel are grasping the bull by the horns.

In a recent bout of acquision Intel has bought into its fold since June (though doubtless the negotiations have been going on for much longer) WindRiver, Cilk Arts and now RapidMind.

It's an interesting mixture: the dominant embedded software provider – at least as far as UNIX/Linux systems are concerned – and two smaller, but very innovative, software companies that appear to complement Intel's existing offerings rather well.

Cilk Arts was an MIT offshoot, targeting dynamic and highly asynchronous code. Its C (Cilk) and later C++ (Cilk++) directed offerings align well with Intel; in fact James Reinders said in a blog piece that Intel's Threading Building Blocks (TBB) was inspired by the work of Cilk. So Cilk's Cilk++ offering clearly fits at several levels.

RapidMind's technologies are widely admired throughout the industry and it is hardly surprising that Intel will continue to market RapidMind's offering (though whether under a new label or not, and for how long, remains to be seen). Its tools could fit well with Intel, depending on which way Intel is moving. But we must presume that Intel wouldn't have acquired them were they not broadly heading down the same path.

WindRiver, at a price of some $884 million, is by far the largest of the three, but offers an interesting synergy. It has been suggested that the embedded Linux expertise will sit well with the Larrabee architecture, as the chip has to run its own operating system in order to be able to handle the complexities of its architecture and integrate with the more mainstream device achitectures currently featured in most boxes.

WindRiver also brings considerable mobile expertise. The whole area of mobile devices is one that Intel has targetted for a number of years and which is growing in importance both for the company and for the market as a whole. The current continuing growth in markets for all categories of mobile device, as well as the sheer number of new types of devices that people talk about, means that the area will likely continue to support substantial growth over the next few years. Intel is already a major player there and its strategies would clearly benefit through the acquisition.

So where dos this leave us with respect to parallel systems? Well, it clearly enhances Intel's parallel tools and probably presages a broader development of C++-related technologies to complement its TBB, OpenMP and related offerings; and it provides, at the very least, the ability to expand its tool chain.

But what of the opposition? They haven't as yet replied in similar fashion. As we have said before, we could get once again to a position in which one large or influential company sets the course of software for years to come.

Part of the problem is the sheer diversity of parallel systems. They extend from asynchronous, highly asymmetric embedded systems to highly symmetric ones and with processor complexities running from relatively simple architectures to the extremely complex. The program development issue is then further complicated by the sheer variety of algorithms involved.

Does this mean that there is no single solution to the issue? Possibly not. We have yet to find a single language or development environment that can embrace all types of applications. That may be something that may not change. For the moment, though, Intel is forcing the pace.


Fri, 14 Aug 2009

Multicore, Threads, Message Passing, and Operating Systems

Russel Winder gave a presentation at UKUUG Summer 2009 conference entitled Shared-memory Multithreading is the Wrong Way to do Parallelism, the slides can be found here. As well as emphasizing the move towards lightweight processes and message passing, cf. Erlang, Scala, etc., the session raised the question of whether current operating systems would be up to the task of managing systems with multiple processors, each of which had thousands of cores all using distributed memory – single central memory architectures are untenable in the presence of very large numbers of processors/cores.


Wed, 12 Aug 2009

Concertant's Francis Wray appointed as Visiting Professor at Kingston University

Concertant's Francis Wray has been appointed Visiting Professor in the Faculty of Computing, Information Systems and Mathematics at Kingston University. He has a very strong background in parallel processing which has recently become highly topical due to the advent of multicore processors, and is very relevant to the work of the Department.

Concertant has a long tradition of working with a wide variety of institutions. We can apply our unique blend of expertise in areas as diverse as aerodynamic simulations and embedded systems to the benefit of our clients.

For further information please contact: info@concertant.com


Mon, 22 Jun 2009

Between a Rock and a Hard Place

by Russel Winder

There is much speculation lately around the future of the Rock processor, now that (as seems highly likely) Sun will be absorbed into Oracle. Many of Sun's businesses will find a place in the Oracle structure: the storage business, the server business, the OS business, the Java business all have ways of being absorbed into a sensible Oracle strategy for growth. Sun's processor development business on the other hand seems out of place in this context.

Analyst and journalist speculation is that the most likely outcome will be that Sun terminates the processor development business prior to the Oracle take-over. The end of the Rock processor will not stall the rise of multicore processors: multicore is now the norm. Rock's 16 cores are now nothing remarkable. Being able to run two threads per core is no longer remarkable. Would the demise of this processor, which has been five years in the making so far, be at all remarkable? Well yes.

Rock was going to support hardware transactional memory. To loose this is indeed something to remark on.

Shared memory concurrency is hard. Shared memory parallelism is even harder. The problem is synchronizing access to storage being used by multiple threads. Programmers generally get it wrong. The problem is that the tools of locks, semaphores and monitors are all too low-level for the average application programmer. What is needed is a higher level of abstraction. The same happened with memory allocation: explicit allocation of memory by programmers led to unmaintanable programs containing many errors, and lacking in portability, so new abstractions were introduced to make things workable.

Transactional memory is one technique being proposed for ameliorating the problems of synchronization in shared memory parallel systems. Experiments with software transaction memory have been very encouraging. However being in software they have some performance issues. Hardware transactional memory would really have been a revolutionary step forward.

So if the Rock processor will never be manufactured, will Intel, AMD, IBM, etc. step up and add hardware transactional memory to their chip lines?


Fri, 19 Jun 2009

Carter – a false economy?

by Peter Dzwig

The Carter report on “Digital Britain” calls for “universal connectivity” with bandwidths of 2Mbit/s nationwide. Is this a realistic target? Is it a worthwhile target? Probably not.

Looking at the growth in Internet traffic and bandwidth over recent years, that target figure seems absurdly low, let alone sufficient to make the UK “the global Internet business capital”. As we all know, a promised delivery rate generally falls far short in practice. If the rate were being promised to your front door, that would be a great deal better, but the report doesn't say that it is. Surely the £150–200 M or so that will be raised by a £6 annual levy on fixed lines to help fund the service won't actually go very far towards delivering a reliable high-speed net. In many other nations, target bandwidths that are much higher than the UK's would-be goals are already being delivered.

In the UK perhaps more than in many countries, delivery is very uneven. A quick check on SpeedTest reveals that, in the UK, Ceredigion has the fastest average download speed at 10.2 Mb/s with upload speeds of 7.7 Mb/s. A glance at the relatively affluent, high population density commuter areas such as Hampshire and Surrey shows very large areas in which speeds are well below 1 Mb/s in realistic terms. Many feature as “notspots”. The reasons given by industry for there not being greater connectivity is economic. In SpeedTest's global lists UK features 54th in the upload speed league and 41st by download speed.

There is no doubt that the aim of building a high performance UK-wide network is well-worth achieving, but that is a very substantial undertaking. But why go off at half-cock? The only real solution is going to be the provision of a nationwide fibre network capable of taking the traffic that we are going to need in ten years time. Not just today. Such a system has to provide much better speeds in terms of both upload and download. While people concentrate on download speeds, next generation interactive applications, video conferencing, video streaming, image download for printing and gaming need for speeds to be greater in both directions. This need will only increase. It would seem logical that, in order to be able to deliver this we need to put in a national fibre network and to do that it is going to take more than a couple of hundred million per year. In that case one is forced to ask whether the Carter strategy is right and whether it is not too modest in its aims, laudable though those may be in the short-term.

Emerging technologies, hardware and software are behind this, particularly in embedded applications. Applications are being built to take advantage of user demand because these can be realised with increased processor ability available and coming over the next few years. The growth of processor capacity has driven by new technologies. Key among these is multicore...which is why you are reading this here!