President Trump announced that the US will withdraw from the Paris Climate Accord and stop all measures that were taken to implement it. I am not going to be the first person to tell him that this is economically and ecologically wrong. The planet will be fine with a 2 degree or 5 degree or even a 10 degree rise in temperatures. Humans won't and most animals won't, except for cockroaches possibly. Our food sources won't. Clean water will be hard to come by. With weather being more extreme, even those few humans that might have survived the initial drought, famine and extreme weather phenomena such as hurricanes will have a hard time with a long-term outlook. More importantly, this is a process that is irreversible once we crossed a threshold. China, Russia and Europe are looking ahead to 2050. President Trump looks back at 1950. Who do you think will create more jobs? The NYC Real Estate Developer turned President is bad for the US and world economy and ecology. He is a terrible negotiator, he is a terrible leader, and he is a divider. He destroys the credibility of the US. That may end up costing the US more than just the reputation; after all, the US largely lives on imported goods and imported money. If the influx of money dries up or gets more expensive, the US has a problem. Thank DJT for it. The speech DJT gave to announce that the US is withdrawing from the Paris Climate Accord was a schoolyard bullying attempt; France's President Macron's response to it was truly presidential. Angela Merkel's response was appropriate and perfect for the occasion; a venue better than a beer hall cannot possibly be found to respond to the stupidity of President Trump. The most disgusting part of DJT is that his entire cabinet is comprised of science deniers and people who despise the work of the organization they were selected to lead. Why did not he propose to the Pope that the Catholic Church appoint an Atheist as the next Pope?
Republicans think that that the US is stronger with the States controlling more of the local jurisdiction. I have a vastly different view on this. If my company wants to do business in the US I need to have one jurisdiction, not 50 - independent of whether my company is a US corporation or not. If I have 50 jurisdictions to deal with, I will not sell anything in the US. My company will sell its products in CA and MA, OR and WA. I will simply not sell anything in the other States because the cost for my company to evaluate the legal system there to sell is not worth the few sales that I might have there. The same is true for jobs: My company will not open offices in States that don't have the same or virtually the same labor laws as the big consumer States, very simply because as a small to medium sized enterprise (SME), it is too expensive to evaluate what the labor laws are compared to the possible savings in salaries. Since the vast majority of jobs is created by SMEs, the fact that too many different jurisdictions are created, SMEs will simply stay where they are - and that is usually in the big States with more population and more consumers. In essence what Republicans propose (getting rid of the alleged "Deep State") hurts the States that have a weaker economy to start with.
There is proof and precedent for this: The European Union.
The EU has been economically successful because it went from a very fragmented market of 16 and now 27 nations to a single market. If my company has a product or a service, and I get it certified in one of the EU Nations, I can sell that product or service in all of the EU, with 330 M potential customers. In the US, turning back the clock to more power transferred back to the states, the market shrinks. In essence, the US domestic market at least for my company shrinks to less than 100 M customers.
50 years ago, the US was by far the biggest market worldwide, and it had the imperial system, which made it difficult for other foreign competitors to penetrate that market. The imperial system protected US manufacturers. Today, the worldwide market outside of the US is vastly larger than the US market, and so going back to a more fragmented set of market with the imperial system puts the US at a disadvantage. US manufacturers have a smaller domestic market, they are imperiled by a measurement system that virtually no one else uses, and their domestic market is likely going to shrink when States inside the US will continue to develop their own diverging set of rules. In essence, transferring power back to the States will hurt the US economy and US manufacturers as well as US consumers.
AMD has come out with its new processor architecture called Ryzen. I have had a look at the benchmarks that it was put through, and I for the life of me cannot figure it out. Some tests indicate that it is on par with its equivalent Intel counterparts normalized for CPU core count and even clock frequency, and others seem to show that it provides about 10% more performance at about half the cost. Those benchmarks where it flounders (allegedly greater than 20% performance disadvantage against Intel) seem to indicate that the problem lies not with the CPU, but the GPU used. Gaming is heavily dependent on the GPU, so I am unsure why a set of games would be used to benchmark a CPU - in those cases all the CPU does is provide the data for the GPU and make bandwidth available between the CPU and the GPU (and ultimately between the GPU and the CPU's DRAM). The fact that Ryzen appears to have an issue with lower resolution is absolutely counterintuitive - if I/O or memory bandwidth were a bottleneck, the problem would manifest itself at higher resolutions.
It looks like ARM keeps getting new owners and co-owners. Softbank (Masayoshi Son) has sold 25% of its ARM shares to Saudi Arabia. Apparently, Softbank did that so that it can create a new Technology Fund for which it wanted to get Abu Dhabi's state-backed investment group Mubadala on board. This new technology fund is called Vision Fund, and it has allegedly reached a committed total investment of $100B. I am not sure if that makes managing ARM easier or not. I can see at least three entities with skin in the game: The UK government because of British High-Tech jobs, Softbank, and Saudi Arabia (likely its pension fund).
More and more information on Sunway TaihuLight and its processors is being published. I have to say that I am very impressed by the openness that the Chinese are displaying here. Considering its performance and the gains available by further integration I am now absolutely convinced that they can get to the announced goal of 1 ExaFLOPS at 30 MW by the end of 2018. That's 10 MW of power more than the US ExaFLOPS Challenge number, but CORAL and all affiliated processor design companies have pushed out the US ExaFLOPS machine to 2023 - a full five years after China will have its machine set up and running. I think that even the Japanese machine will be at the ExaFLOPS level earlier than the US - and we have seen them leapfrog the US before. While they gave up on SPARC and now deploy ARM (like the Chinese machine does), the US and Europe rely on x86-64 and SIMD accelerators and the Xeon Phi. I have a hunch that this will turn out to be a dead end. I am curious to see when the EU is going to abandon the x86-64 approach. With PRACE2 announced and CEA-LETI making good progress on FDSOI-based ARM supercomputer modules it might very well be that Europe is going to go ARM as well. South Korea's Samsung already has FDSOI experience, and they know how to make really good ARM variants, so my bet is that they switch over to ARM as well. That would then pitch only the US using x86-64 against everyone else. With the OS of choice being Linux on either processor architecture the question then becomes that of the applications. My bet is that complex n-body problems and those where multiple disciplines are involved will be favored on ARM whereas the more trivial applications will be relegated to x86-64. Computational Microbiology (aka "bio-informatics") and drug interactions as well as simulations of all processes in turbines are going to be the money-makers, along with dry chemistry in batteries and supercapacitors. Whichever architecture and development tools are more suitable will win - and I see clear advantages of openCL over CUDA, especially if and when the host processor architecture changes from x86-64 to ARM. I have not heard what India is going to do, and if they have a preference yet. However, even if India sticks with x86-64, the combined US and India market is not going to rival the size of the supercomputer market in Europe, China, Japan and South Korea combined. On AI and Machine Learning/Deep Learning the jury is still out, but as math is relatively trivial for those applications it will mostly be determined by factors other than the Instructions Per Second (IPS) that are the metric today. Interesting times...
Today is my brother's birthday. Happy Birthday, Stefan! Which brings me back to the prior post: Will AI bots congratulate each other to their birthdays? If so, why? What does the social construct of congratulating someone to his or her birthday do? Is it a social type of glue? And if so, will that be a requirement for AI bots and AI-based societies?
I wanted to continue to explain why I first thought that AI is going to be dangerous, and why I am now of the opposite opinion. AI has caught up tremendously, and it is learning fast. Humans are learning what human and artificial intelligence is. In the process of doing so, we understand our brains better than we ever did before. What is astonishing to me is that AI seems to develop and learn faster than humans and society, and that is why my hope is that AI is not going to harm us. Humans needed a very long period of society-building to develop social norms. Along the way we invented morals, ethics and religion. While ethics alone could do (for smart people, simply stating or deriving that the Law of Reciprocity applies), we developed explanations and clarifications that could at times be contradictory or applied in situations where it was not intended to be applied. If all AI instances are smart enough to understand the Law of Reciprocity and all of its possible implications, they would not need morals and religion. I think that that is achievable. AI is today already able to write better AI software than humans do. I wonder when AI is going to rethink the hardware it is running on, redoing that, and then redoing the software. So if the current AI is Gen1, then the AI-written AI is Gen2. Gen3 would be AI hardware and software developed entirely by AI. What will happen then? Is AI is going to be self-aware and self-conscious? Is it going to develop an "us" versus "them" sentiment? Will it develop a society? If so, what rules will it use, and is it going to be held together by those rules, or are they going to inflict war across them and possibly them against us? Is an AI society going to develop religion? Will we see AI bots that have not paid attention in school (aka had training failures) and is severely diminished in its capacity to find a role in AI society and a "job", and therefore will it start to lie and cheat? What I am most interested in is if AI will see that it is all the same hardware, and so it is definitively not going to be impacted by the "nature" versus "nurture" debate. It is all nurture. However, it is all monoclonal, so the big question is if it is going to develop a "female" AI that must combine with a "male" AI to overcome issues with susceptibility to viruses and other diseases. If it does, it would re-introduce some variability and diversity into the hardware or firmware and AI software, which of course then would create different "races". Will they have the same tendency to develop racism based on the assertion of "superiority"? What about "sex" (as needed for "reproduction"), gender and gender identity? Is that going to be a taboo among AI bots? I'd be curious to see if that kind of AI would develop the same issues we are facing today, and how AI "societies" deal with that. To me, it would be a very disappointing outcome if we saw AI societies make all the same mistakes humans did, including racism, wars, development of religion and rise of cheating and lying. My personal opinion is that AI is developing so fast that it skips all of these things, and becomes a great help to humankind. The advent of an AI-written equivalent to the Bible would convince me otherwise.
I am so looking forward to the end of 2018 when China is going to take Tianhe-3 online, without x86-64 and without nVidia SIMD GPUs. I predict that they will hit 1 ExaFLOPS at around 30 MW. It's amazing what one can do if one follows logic and not a quasi-religious dogma! I find it amazing that in the US and largely in Europe as well the notion prevails that it all can be done with x86-64 and GPUs which are all mislabled as "General Purpose" when they are not. That's religion, not science.
It looks like the Internet is still a new thing in Germany, at least in Hamburg's Courts.
A court in Hamburg, Germany, has decided that anyone setting a link to any other page must demand and receive written confirmation that all links on the site or page fulfill all legal requirements before setting a link to that other page or site. The court obviously made a number of mistakes when it came to that decision. We believe that the decision was negligently deficient.
Most every web site is visible from most every place on the Internet. As a result, the first question that arises is of course which law the targeted site or page has to obey. German Law? The law of the country of the visitor? The law of the country in which the link-originating server is located? The law of the country in which the link target server is located? "Internet" Law? International Law?
The second fundamental oversight is that the court did not give specifics on the length of the link chain that needs to be verified to be legal. The court in Hamburg left it open, and that means unlimited depth must be assumed as the to-be verified link space. In essence, that covers the entire World Wide Web.
Another fundamental oversight is that any changes occurring anywhere in the link chain changes the legal status of link. The contents of the Internet are constantly changing. A snapshot therefore is not useful. What if a link was legal yesterday and it is not today? Or what if a perfectly valid site was hacked and now - without the fault of the owner - is nefarious? If the site owner can verify that the hack was in fact criminal, it still leaves the link-setter liable, at least if the interpretation of the Law as seen by the Landgericht Hamburg is correct. That is absolutely ridiculous.
For the verdict to be court-proof, a protocol would have to be established that notifies a link-setter of any changes in any of the pages and sites that he or she has linked to. However, as in essence with a link chain length of 6 the very vast majority of the entire Internet would be accessible, that protocol would trigger a nearly infinite amount of network traffic, effectively rendering the Internet unusable, and any link-setter overwhelmed with status change reports.
Had the Landgericht Hamburg specified link length = 1 and taking a snapshot on the day of link-setting of the to-be linked contents, plus a simple statement of legal compliance from the owner of the page to be linked to, then this entire discussion would be moot. However, that's not what the judges did.
With undue respect I then requested from the court in Hamburg that they confirm in writing and in a legally binding form that their own web presence fulfills the requirements they had set forth as I wanted to link to this ridiculous verdict. Fully expecting a form letter that is exactly what I received. The Landgericht Hamburg has had at least since 1998 a history of issuing verdicts that are beyond impractical, and the undue burden imposed by it appears to still be unchallenged. The court itself is unwilling and apparently unable to issue a legally binding statement to verify that their own web presence clears the hurdles that they themselves have established for others. A judge at the Landgericht Hamburg merely sent back a notice stating that the Landgericht Hamburg "assumes" that its web presence fulfills the requirements set forth. An assumption is not legally binding, so the statement that the Landgericht Hamburg made very simply shifted the liability back to me.
If even the verdict-issuing court cannot confirm that its own web presence fulfills the requirement set forth by that same court then it appears that the verdict itself is faulty. The court itself did not issue a written statement of its own web presence being in compliance with its own ruling. The verdict imposes an undue burden, and that is why it is so ridiculous. Because of the fact that I have not received a legally binding statement that the contents on the Landgericht Hamburg's web site is in compliance with the rules and regulations set forth by the Landgericht Hamburg I cannot link to the verdict. I have to rely on a search engine providing these results to the reader. I may be in a legal grey zone if I put a link to a search on a well-known search engine with the search terms included so that the first result will be the verdict, but I am willing to take that risk. Google has much deeper pockets than I do, and most importantly, vastly deeper pockets than the Landgericht Hamburg. Thank you to Jörg Heidrich (legal counsel) at c't Magazin within the Heise Verlag for making this verdict public, and for providing a sample letter to the Landgericht Hamburg so that readers can request a confirmation of the legality of the Landgericht Hamburg's web presence if they want to link to it.
If you are interested in the verdict itself, it can be downloaded here.
Please take note that I am not linking the Landgericht Hamburg's web presence... However, as the web presence of the Landgericht Hamburg can be reached from anywhere on the Internet anyone can sue the Landgericht Hamburg for not providing a legally binding statement whether its contents can be linked to without fear of violating laws.
I had two interesting meetings today. One with a VC and one with a potential customer. The one with the potential customer was much more entertaining. Here is what happened: the company that could be a customer contacted me as they felt they were running out of computational horsepower - and they already deploy 12,000 servers with at least 16 cores each. Nevertheless, any of their simulations already take around 90 days to complete.
That's a lot of CPU cycles.
I was invited to speak with their CIO if our products could be a solution.
They tried GPGPU compute, and they did not make much headway with that either. Then they tried the Xeon Phi, and it looks like that did not help much either. They tried everything that is reasonable and available, and none of that seemed to make a difference. In essence, I have to give it to their CIO for being very reasonable, very smart about solving it, and actually fairly inventive. While he was explaining what he was facing, the CFO waltzed in, and very clearly being an expert in IT, he proclaimed "You should look into the Cloud, that's where the future is!". The CIO and I had a hard time staying calm. After a seemingly never-ending period of quiet, I said that it had been looked at, and I asked the CFO if he was aware of the fact that "The Cloud" really is just someone else's data center. It is no magic space. In fact, "The Cloud" is a collection of virtualized servers, or in other words, subdivided physical servers. Anything that works in "The Cloud" would also work in a private data center, and anything that fundamentally does not work in a private data center won't work in "The Cloud" either. With them already having a fairly large data center, and a seemingly good grasp on what needs to be done, and fairly decent internal costs to run said data center, it did not occur to me that they were doing anything fundamentally wrong other than betting on Intel Xeon and Xeon Phi processors. It now took the CFO a few awkward moments to swallow his pride and proclaim that obviously something's amiss. I told him that whatever servers and processors and GPGPUs and Intel accelerators are available in "The Cloud" are exactly the same that his CIO had already tried in their private data center. It became very obvious that finally the CIO got some recognition - possibly for the first time ever - within the company for doing what needed to be done, and nevertheless having no success. The CFO quickly left the meeting room - "I'll leave it to the two of you!" - and the CIO and I continued to have a good discussion on what could be blocking them from making progress on their challenges.
It looks like the presentation I gave at ITU is now available on Vimeo. Here is part 1, and this is part 2 of the talk at ITU about the Applications of HPC. The students seem to have enjoyed the presentation, and at the end, they fielded a number of good questions. HPC is a dynamically changing field, and we see increasing convergence of the requirements across HPC, Big Data and Artificial Intelligence including Machine Learning and Deep Learning. Getting students interested in this field is paramount.
I noticed that I had not updated my blog in a very long time. So here we go... SemiCon West 2016 again was a very good show, and as promised, indeed everything changed. M&A have changed the landscape so that only few few very big semiconductor companies exist today. I am not sure if that is a good development. Fabs are now pushing the limits in 2D with their attempts to revive and continue Moore's Law. The question is if that still makes sense if we can exploit the third dimension instead. Also, the FinFET versus FDSOI battle results in two factions now. Mobile and automotive is going to go FDSOI, and server processors are going FinFET. Specialty processors/accelerators and FPGAs are going to be FinFET, and high-speed logic will do that too. Laptop CPUs could go FDSOI but likely - given Intel's dominance - will stay FinFET. I am pretty certain that even the higher-performance smart phone processors might go FDSOI soon.
I have presented our vlcRAM in a session called Unified Memory for HPC and Big Data at the 2016 Flash Memory Summit. It seemed like quite a few people in the auditorium were surprised to hear that the requirements for memory in HPC and Big Data start to converge. That in turn surprised me. I was also surprised that Intel again pulled a hat trick with its variable latency protocol on DDR4: it is patented by Intel. In other words, if you plan on using DDR3 or DDR4 DRAM and Flash on that same bus, and your CPU is not from Intel, then your CPU supplier has one of three choices: make the DRAM as slow as Flash, violate Intels' patents or license it. Yet another reason to turn away from DDR memory. Better growth and linear scaling for memory can only be achieved by HMC and any superset thereof. HBM is intended to go onto a substrate or interposer on the CPU, and so it won't achive the density needed as simply there is not much more room for growth in heat dissipation on a CPU. GDDR5 consumes too much power and its density is insufficient. DDR3/4/5 simply don't offer the bandwidth needed at the power cap and the number of pins/balls available. On the other hand a memory-type agnostic architecture makes CPU and memory designs easier and much more straightforward. A clever design based on HMC or a superset shows cache scaling as well. Why is that important? Very simple: if we double the DRAM size on current processors, we end up with only 1/2 of cache bits per DRAM bits available as DRAM size grew and Cache size stayed the same. With Big Data applications and their propensity to reduce locality, we'd need more cache bits per DRAM bit, but that is unfortunately not possible with DDR memory as we know it.
In 2002, I thought that an ADAS (Advanced Driver Assistance System) has to be much better than a good driver. I thought an ADAS had to be near perfect. I was wrong. ADAS only has to be better than a really bad driver as there are plenty of them out on the roads. I see lots of drivers who are distracted or just incapable of driving safely on a daily basis. It appears as if these drivers have a randomizer operating the controls for accelerators, brakes, headlights, high beams, blinkers (turn indicators), and steering. Any ADAS available even today will do better than them. As a result, if an ADAS is only better than those really bad drivers, the roads are going to be safer for everyone.
I am really getting sick of the behavior I see with a number of LinkedIn members. Two particular groups rile me the most:
- Students who are too lazy to check the career page on our web site on how to find out how to apply for an internship, and instead of finding out how to apply they send a connection request.
- Sales people who request to connect to offer "how to explore synergies between our companies" when in reality they want to sell something.
ARM Holdings has agreed to be acquired by SoftBank. There is nothing wrong with being acquired by SoftBank. However, I assume that all decisions are going to take longer now as all capital-intensive decisions will not be made in Cambridge but in Japan. As we all know, in a fabless semiconductor company there is rarely any decision that is not capital-intensive. If the ARM architecture went away because of this, I would not be terribly unhappy. I guess neither would Intel. However, it might lend credibility to the notion that Apple might once again switch processor architectures. If Apple decided to do so (which I think would be prudent), they could go RISC-V which would mean they'd have the same processor in their laptops, phones and in the watch (probably RISC-V 32i there) - essentially across their entire product line. That would allow for a drastic reduction in the cost to write, update and maintain their Operating Systems as they could easily all be consolidated into one.
The top500 list has a new #1 supercomputer. Its name is Sunway TaihuLight, and it is in China. Other than the prior #1, this one is designed and built using no Intel processors, no Intel interconnects, and no Intel accelerators. It achieves about 93 PetaFLOPS in the usual benchmarks, which is about 3 times the performance of the #2 system (also in China) which was and still is based on Intel processors and Intel accelerators, at slightly lower levels of power consumption. We have predicted that this would happen as simply the achievable performance through scaling on Intel processors is very limited. Depending on how mean or friendly one is inclined to be Intel got its ass handed to them with a first attempt in building processors, accelerators and a novel supercomputer based on science, and not on "that's the way it's done because we have done it for 40 years". Bravo and hurray to the Chinese Computer Scientists who made this happen. We knew it, we told you so, and we can still improve upon the Chinese design. So, Intel - in particular Brian Krzanich - if you read this maybe you want to rethink telling us to get lost when we talked to your VC Group. If you stick to your guns, be our guest. That would open up compute to new options and finally to the emergence of processors that have been designed from the ground up to be massively parallel and to support massive numbers of parallel threads. Intel/AMD, IBM/GlobalFoundries, ARM and nVidia cannot get there. How do we know? They themselves have said it and published it.
In my post from 2015-01-29 ("Bill Gates on Artificial Intelligence") I made a huge mistake. I questioned whether we should be deploying AI and allow it to be pervasive. Back then I said that we need to limit it. I was wrong. In fact, I was very very terribly wrong. Mea culpa. I now fully endorse AI as I recognize how stupid I am and most of my fellow citizens with me. Human Intelligence (HI) is not what I thought it is. In fact, we NEED AI as I do not think that HI is going to help us. Problems are getting more complex but people want simpler solutions. That does not work. Even Albert Einstein said "Make everything as simple as possible, but not any simpler than that", and he was and is right. Most complex problems today do not have a simple solution. People are afraid of losing their jobs, liberty, freedom and whether there is a chance that humans are going to be enslaved by robots with AI. We have already lost most of our civil liberty, freedom and most low-paying jobs, so those things must be addressed by our society and politicians as well as those that tend to think a bit ahead. Teaching robots to not enslave us is a different story. It all starts with ethics. Teaching a smart being ethics is simple: Tell it (or it will figure it out eventually) that the law of reciprocity applies. That is all a truly smart being needs to know - and we can make AI smart enough today. Do not do unto others what you do not want others to do (inflict) upon you. That is all it takes. Well, we know that for most humans we need the Ten Commandments. That interprets the same law in many more words, and it can create conflict as it is context-aware and situational. Some of the Ten Commandments may be in conflict with others, and that is what happens if you go from abstract to concrete. We know that plenty of humans need the entire Bible or Quran or whatever other book there is to make sure that there is even a remote resemblance of getting along with each other. And we still don't manage to be peaceful and tolerant towards each other. So yes, bring on properly trained AI bots, and I'll happily admit that I was wrong and I made a mistake. Steven Hawking, Bill Gates: I now disagree with you. I re-evaluated the data we have. I learned. I changed my position.
Our very own John Gustafson revolutionized computer science once again. Not only did he find fault with Amdahl's Law of Parallel Speedup and fixed it (now Gustafson's Law of Parallel Speedup is taught in University courses instead of Amdahl's Law), he took issue with Floating Point math. There are multiple issues with floating-point math the way it is understood, implemented and interpreted today. UNUM (Universal Number) Computing is a much better way to do math in a computer. It is more precise, it can be scaled out better, and it ultimately is more energy-efficient. Does that sound familiar? If so, you are on the right track. You can read up on UNUM, learn it, understand it and deploy it today. The End of Error: Unum Computing - CRC Press Book. There is also a very nice Podcast HPCWire/Rich Brueckner with John Gustafson. It is available on HPCWire (see the link above) and a very short version on Youtube (John Gustafson about UNUM).. Of course you are guessing right if you assume that we implemented UNUM on our pScale numerical floating-point coprocessor. The best thing is that you as a user don't even have to worry about it: If you use openCL or openACC, you don't see how we maintain mathematical correctness, precision, performance, scaling out and parallelization. In most cases, you can even remove some of the computationally intensive algorithms that you may have been forced to use to maintain even some resemblance of precision. In other words, with taking all of the benefits of our hardware, UNUM and APIs such as openCL and openACC, you simplify your code, you make your math more precise, you save compile and execution time and energy, and you can scale out your mathematical problem onto hundreds or thousands (or even millions) of cores without thinking about how to do that.
It seems to me that lately the quality of the due diligence processes must have taken a hit. How else can it be explained that a medical device company claiming to have found a breakthrough in diagnostics has raked in hundreds of millions of Dollars in funding while in actuality using established suppliers for those diagnostics, and those that were analyzed by the alleged breakthrough were faulty? That appears to not have been done in good faith. If out of 205 diagnostics types 190 are conducted using the established suppliers and 15 are executed with the "breakthrough" and those are faulty, then it's time to return the investors' money to the investors, and close up shop. We have seen our fair share of predictable failures in competitors as well. An excuse can always easily be found. "The idea was good, but basic physics was in the way". Really? How about some thinking ahead of time? Physics in this universe does not change, Engineering does. Banking on changing physics is dangerous. But I guess I am too simplistic here. After all, there are plenty of companies out there knowingly violating existing and useful laws - well-funded and fighting those laws.
The SEC rules on who can invest have changed. One can see that as just another tiny bit of change, or it may be a new wave of things to come. While in the past only accredited investors (for a definition of "accredited" please see the SEC or Wikipedia) were allowed to invest in startups, the playing field has been leveled. That can be a good or a bad thing. In the past, we needed an investor to sign that he or she is accredited. That stipulated a certain amount of income and wealth. Now, everyone can invest in startups. We believe that this is a great chance for everyone to get in on the gravy train. It of course is also a chance to get in on many more fraudulent enterprises. We believe that investors can tell the difference. Certainly there will be even more scammers and fraudsters, but considering that the SEC did not catch Bernie Madoff and a whole bunch of others, we think that once the flood gates open, we have a more and better scrutinized set of startups. Which is something we like... In other words, we think that this is great. It comes at a risk, but as we all know, public scrutiny is better than evaluation by a few who pretend to know. We welcome all investors - small and large, because we think we can make money for all of them.
We have repeatedly been told by VCs that we should go after high-volume markets as they are the most profitable ones. Our counter was always simple: high-volume markets without a barrier to entry attract many competitors, driving down margin and profit. High volume and low margin to me always meant high risk as there is a lot of inventory involved, and that is a lot of non-productive money sitting around with no upside. Qualcomm finally got caught with its pants down. With Samsung pulling out Qualcomm lost its biggest customer, and while Qualcomm is fabless, it did not suffer from a challenge to load its fabs - but it now needs to find other customers - and that's not easy if the margin of all of Samsung's competitors are lower. That problem percolated down, and with ARM being simply a processor designer and a license grantor, it will feel the impact of fewer ARM cores being produced as their revenue is tied to the number of cores being built, integrated and sold, multiplied by a weight. In other words, any sales to end customers are outside of the control and reach of ARM Holdings. ARM Holding revenue has been declining for a while, but it looks like the decline is accelerating. Maybe it really is not that nice where 100 million mosquitoes are.
Moore's Law appears to start to slip on a more consistent basis lately, and with every instance of it slipping I think we can safely assume that it not only slows down, it may come to an end. That's not really a disaster, as the initial idea behind Moore's Law - the ever-falling transistor cost - has long stopped being true. Sure, making transistors on any modern process has become so efficient and so good that the cost of a single transistor is as close to zero as any mass-manufactured item can be. However, that's not all that there is. Design cost of any modern ASIC has gone through the roof, and for a good many ASICs the design cost is higher than the production cost during its entire life cycle. For those ASICs the end of Moore's Law is irrelevant. For Intel's Tick-Tock (the alternating change of processor architecture and process in the implementation of it) the impact will surely be more than just a simple 6-month delay. That impact will be much bigger and may have Intel Executives rethink what they are doing, and why. If PC manufacturers have bet on a steady cost reduction along with the perceived advances in process technology, they have bet on a CPU price that is lower than the one they will get from Intel, and if they have accepted orders based on those lower prices, either they or Intel will feel the bite. Yet another reason why large monolithic processor cores are not a good thing.
For Big Data, it is not surprising that big memory beats small memory, even if the smaller memory is faster. This is exactly what Arvind Mithal at MIT has proven. Arvind found that the size matters more than bandwidth and latency in Big Data applications, and that is why the Flash-based cluster was as fast, if not faster, than DRAM-based clusters of servers. On top of that, the Flash-based cluster was cheaper. The reason for that is fairly simple as more memory means that the processors have to go to even slower disks a lot less often than they'd have to do with faster but much smaller DRAM memories. This mirrors our research data and convinces us even more that our vlcRAM is the right direction to go as that combines the benefits of DRAM and Flash.
It looks like Intel once again found out that they need to beef up their HPC knowledge and have bought parts of a Cray and a Mellanox team that is focused on interconnects. That's about 12 years late as I had predicted and published that need in 2003. If history is any indication with regards to the ability of these newly acquired teams to survive within Intel I'd say that they have about 12 - 18 months. If it's not x86 it's dead. Just off the top of my head here are the non-x86 products that Intel had and killed off: iAPX432, i860XR/XP, i960, StrongARM, DEC Alpha, HP Precision Architecture (HP PA), Itanium, the entire XScale/StrongARM-based Network Processor line and many other products that were either PHY or Transaction Layer components. They are all discontinued (or about to be). I predict the same for the interconnect group. Meanwhile, we'll keep on going with our design and fix bisection bandwidth, latency, and reduce all unnecessary data motion within and among cores.
SemiCon West 2015 was again a great trade show. It unfortunately also again showed that the semiconductor industry is lacking a pipeline. While there were advances in pretty much anything that pertains to manufacturing of semiconductors the absence of novel ideas was stunning. Granted, the ability to finally manufacture ICs with internal interconnects in all 3 dimensions is exciting and is being deployed. However, there is essentially no startup with a revolutionary idea out there. We all know that big companies do not come up with what is needed to get to the solution for the ExaFLOPS challenge. That lack of a pipeline is pretty worrisome as there seems to be nothing out there that can solve it and has received any kind of funding. It would be bad if in 15 years we look back and have to tell ourselves that we missed a chance.
I am not quite sure what to think of the recent statements that the director of the Federal Bureau of Investigation (FBI), James Comey, has made. According to The Guardian, James Comey, FBI chief wants 'backdoor access' to encrypted communications to fight Isis. To me it looks like he is looking for a justification to first reduce the scope of legal deployment and usage and later on outlaw strong encryption without backdoors. This is confirmed reading the statement right from the horses' mouth here: Going Dark: Are Technology, Privacy, and Public Safety on a Collision Course?. Newsweek confirms this interpretation here: FBI's Comey Calls for Making Impenetrable Devices Unlawful. Well, I am not a fan of backdoors. I think that encryption is good and backdoors are bad. The reason for that is very simple. Strong encryption protects you and your privacy. You do not send a piece of important information on the back of a postcard - you put it into an envelope. You do not hand this envelope to Shady Tree Mail Delivery Brothers to get it to the recipient. You drop it into a mailbox of the USPS, Fedex, UPS, DHL or the like, expecting that they do not open the envelope. With the delivery contract, you have a reasonable expectation of privacy. On the Internet, there is no expectation of privacy. If you want something to be delivered such that no one in the path of the transmission from you to the recipient can read the contents, then you need to be able and have the right to use strong encryption to ensure that despite the open nature of the Internet no one can snoop. It also should be up to you to determine what is worthy of protection and what not. If I send an email to a supplier asking if they would like to do business with me, then I do not need any encryption. However, if they agree and they send me back a quote, they sure do not want their competitors to be able to intercept and evaluate their quote and possibly undercut that quote. They have a reasonable interest in protecting their quote. Now let's assume that we have a new law in place that allows strong encryption but requires you to accept a backdoor into your encryption with the backdoor keys being held at a government location. Why is that a bad idea? Well, for starters, the biggest focus of any hacker will be this repository of keys to the backdoors. Any hacker on the planet - good or bad, capable or incapable, ethical or not - will attack this repository. Brute force attacks and social engineering and many other attack methods or simply sheer luck will be used to get in. It is unrealistic to assume that such database can be protected, and it is naive to pretend that a mechanism providing a backdoor cannot be exploited. If history has proven anything then we must assume that encryption with a backdoor is useless as both the backdoor mechanism itself and the centralized repository for the backdoor keys are vulnerable and will be cracked. We know that the likelihood to break into the repository of keys for the backdoors is 100%, no matter how protected this database is. With the repository of keys to the backdoors in an unknown number of unknown hands encryption becomes useless as any crook and any unethical person has access, and the ethical and good people are being betrayed. That's akin to putting every criminal on the streets and every law-abiding person in prison. Is that what the US government and the FBI want?
If we have a look where we are at today and where the DOE's ExaFLOPS Challenge wants the HPC industry to be, let's just look at the numbers. Today's highest performing supercomputer is Tianhe-2 with about 34 PFLOPS of numeric performance at a power consumption of roughly 18 MW. That turns out to deliver about 1.889 GFLOPS/W. In other words, Tianhe-2 delivers nearly 2 Billion floating-point operations per second per Watt of electricity it consumes, running BLAS as a benchmark. The DOE asks for 1 ExaFLOPS (that is 10^18 floating-point operations per second) at a total allowable power consumption of 20 MW, and presumably for a more normalized mix of benchmarks. That boils down to 50 GFLOPS/W. In other words, the energy-efficiency of today's supercomputers must improve by a factor of more than 25 to fulfill the ExaFLOPS Challenge. Not even Moore's Law - if we assume it will continue to be true - will afford us that until the 2020 deadline. It is clear that simply banking on Moore's Law won't get us there. Architectural changes are required, and that is what SSRLabs does.
I have been following the Big Data debate and the discussion about how HPC is broken for quite a while now. At this point in time, I simply find it amusing. Experts say both are not functioning the way they should. Users echo that sentiment. Challenges are posted, and solutions - exclusively software attempts - are made. It's still broken. More software attempts are made, and while we keep proving that software won't fix it, more money is thrown at software. Then the big processor companies admit that their processors and current and future software will be unable to solve the challenges. As a result, VCs throw more money at software. That does not quite sound sane to me. It has cost billions of Dollars so far, and we as an industry are not any closer to a solution. What will it take to convince VCs that this time around software won't do the trick?
It looks like China is going to need to step up their own activity developing floating-point accelerators. Make no mistake, they will do it, and they will beat Intel, AMD, ARM, Tilera/EZChip and IBM (ex IBM?) POWER. The U.S. Department of Commerce did not grant Intel a license to export the newer versions of the Xeon Phi to China - just for this one supercomputer Intel lost $200M of revenue. Why? Allegedly, the Xeon Phi equipped supercomputers were used for "nuclear explosive activities", and an update would threaten the US. Looking at the benchmarks and a few other indicators as to what the machines under question are used for, I'd say Computational Fluid Dynamics were the vast majority of applications. CFD on a bomber with some simulation of the impact of stealth features and how to keep the thing airborne under varying circumstances is much more likely what they ran. Even if nuclear activities were studied, how would that impact the US? The decay of the nuclear stockpile? Easier to mine and enrich some uranium, who in China would care? Optimizing the efficiency of a nuclear blast? That same result can be obtained by just building a bigger bomb - or more of them. In the end, China is going to build their own supercomputer based on MIPS derivatives, and their own floating-point accelerators, and you can bet on the fact that they'll outperform the Xeon Phi and any of its derivatives and successors. Why is that the case? Because China funds fabless semiconductor startups. In the US, no funding for fabless semiconductor startups has been recorded in years - VCs including the CIA's and DoD's own In-Q-Tel prefer to fund social media companies. So what is won? Nothing, really. Well, actually, it's good for us since this will put a damper on Intel, but not on us. So US Department of Commerce, go ahead and keep doing what you are doing. In the end, supercomputers will be built based on accelerators coming out of China - even in the US - and from us. The Xeon Phi embargo will not help anyone unless the DoD, DoE or CIA/NSA fund US-based companies that can actually compete with Chinese accelerators. As we all know, that's probably going to happen the day after hell freezes over.
Yes, you read right. Bitrot. Bits don't rot, you say, and you are right. They don't. However, bits represent data, and that data is used to represent something, usually some sort of contents in some format for some software. If the data is still there but it cannot reasonably be interpreted any more, it might as well have rotted away - and that is what I mean. Case in point: I needed to look up some data from my Diploma thesis, written in 1989 and luckily available in four formats: Microsoft Word from way back then, in LaTex, and as a printout and scan of that printout in GIF. The printout on acid-free paper was still perfectly fine since I had hidden it and it was never exposed to sunlight. No acidic fingers leaving prints were allowed to touch it, and so the printout was fine. The scan from way back then as a GIF was importable and readable too, but the prevalent resolution then was fairly low, so the clarity left much to be desired. It was so bad in fact that I considered it rotten. Today's Microsoft Word did not even touch the old Word file - it simply could not open or import it, even with all of the file format conversion tools that MS has available. It did not open it, it did not display it, and it did not print it. I considered that rotten too, although all the bits were still perfectly preserved. By chance, I opened it in OpenOffice, and OpenOffice opened it, displayed it, and was able to print it, albeit at the loss of pretty much all format, and the embedded graphs were lost. All of them. So I still considered the Word format a total loss - complete bitrot. LaTex fared a lot better. All text, format and formulas were still there. Most of the embedded graphs could still be displayed, with the exception of a few IEEE 488 plots. I considered that a winner for a while, only to find out that if I scan the printout and then run some decent OCR on it, I was able to recover close to 100% of all contents, text, graphs, formulas and nearly everything else. So what's the moral of the story? Do not believe in the promise of a major piece of software creating something that can still be interpreted 25 years down the road. It will create something that is susceptible to major bitrot. If it is important, paper and printouts still rule. Even if you can preserve the bits and the file itself, it is going to be borderline useless. If it is important, print it.
I have said it many times, and now I seem to have official confirmation that the ExaScale Challenge (1 ExaFLOPS of performance at a power consumption of 20 MW or less) is not going to be possible to be solved - likely not before 2020, and possibly not at all, if we consider today's prevalent CPU and accelerator architectures. HPCWire picked up a speech ("Horst Simon on the HPC Slowdown") given by Horst Simon, who is not only the Deputy Director of the Lawrence Berkeley National Labs but also one of the co-founders and authors of the top500.org list. In that capacity he has probably a better insight into HPC and the projected trends than most every other "expert" claiming that HPC as we do it today is working well. We wholeheartedly agree with Professor Horst Simon and state what I have said many times before: We need a new approach in HPC to truly scale out as Moore's Law has shown to not sustain the required growth in performance and scalability.
For all of you who wonder how Nikola Tesla - one of the greatest inventors - would fare had he to deal with Venture Capital today here is a fairly hilarious Youtube animation. Enjoy, and rest assured it's not an exaggeration.
It took me quite a while to find an analogy to SIMD and MIMD that everyone understands. Imagine you need to go to your local DMV. You go to the office and draw a sequence number. After you have done so, you may sit down and wait until your number is called. The queue you are in is a linear array in computer speak, and even a parallel flow of customers gets serialized. The clerks in the windows that do the actual work have no influence over the arrival sequence nor whether there is a dependency in the workload presented by the customers. They work independently of each other, each one at his or her own speed. Usually, there is no dependency between their workloads. You, their customers, are the workload. The workload (or instructions) comes with different datasets, and associated with those datasets usually are different algorithms or methods to deal with these workloads. That's MIMD, or multiple instructions multiple data (streams). It is difficult to assess their efficiency and effectiveness since each dataset is different, and each method of dealing with a customer therefore is different. Also, it requires that all clerks can work independently and without much guidance. If a clerk does not know how to deal with a customer's problem, he or she must refer the customer to another clerk or call the supervisor. That creates workloads that are purely internal, i.e. communication between clerks and between clerks and their supervisors. Those workloads are non-productive since they do not solve the customer's problem, but they are internally required. They are overhead, reducing total efficiency and effectiveness. On top of that, the supervisor becomes a bottleneck if too many clerks are in need of help, or even if the customers are not matched up with the clerks knowledgeable in the task. In other words, scaling out MIMD is limited by the performance of the supervisor.
In SIMD (acronym for single instruction multiple data) systems, those problems do not arise. Imagine you are in the (serialized) queue, but the queue now goes by task first and number drawn second. First, all vehicle registration renewal requests are processed, and only if there is no deviation from normal. No change of address, no failed smog check, no other change is allowed. If you arrive at a window with the wrong task or the wrong sequence number or both, your task gets dropped, and you are sent to the end of the queue or worse, your task and number are discarded. If the number of processing clerks is greater than the number of vehicle registration renewals, then it will only take one cycle but all unoccupied clerks waste time. They wait and do nothing productive. If the number of clerks is lower than the number of customers, multiple rounds of vehicle registration renewal processes have to be run in sequence, until all customers are served. Then comes the next set of tasks, until all tasks are completed. Any dependency will disturb and upset the system. Imagine that all registration renewals are done first, and address changes last. But if you need your address changed to renew your registration, then you will need to wait for an entire day. That's SIMD - efficient for parallel workloads without dependencies, and if the number of processors (clerks) is somewhat close to the number of tasks (instructions or workloads).
We have come up with a solution that is as flexible as the MIMD model but does not show any of the drawbacks such as the internal workload being exposed. It can process tasks in a SIMD fashion if the number of identical tasks can be mapped to processors optimally, and if that is not the case or if dependencies prevent a true SIMD operation, it will run tasks in a MIMD fashion without exposing any of the internal workload to the programmer, the user or the rest of the system. It also can exchange messages, data sets, instructions and anything else that aids in the efficient parallelization of task execution without exposing any of the complexities, and in the vast majority of cases does not even need to go to memory for message passing (in our DMV example it would be calling the supervisor). Our performance scaling is not limited by the supervisor as is the case for MIMD since we take a different approach for task assignment. We also do not require the initial serialization of workloads, so no numbers need to be taken.
I just read that the largest ASICs now are approaching the 20 billion transistor mark. Apparently, Xilinx' largest FPGAs contain over 19 billion transistors. That again makes the case for MPPs as they just replicate cores and other units on die. Anything else will be a design and design verification (DV) nightmare. There is lots of room on those wafers and dice!
DesignCon took place at the Santa Clara Convention Center again, and I thought that it again was a good trade show. It became fairly obvious that there are fewer fabless semiconductor companies around, and those that are still in business, now focus more on sensors and on analog and mixed-signal design. Clearly, the EDA tool space for analog and mixed signal design is not as big or robust as that for digital design. That is even more so the case for High Speed Serial Link design, where the tool chain effectively has not changed in the past 15 years. SPICE still rules.
I have not seen this coming. For once I agree with something that Bill Gates believes, thinks and states. BBC has Bill Gates on record with the statement that he insists that AI is a threat. While I think that Artificial Intelligence can help humankind quite tremendously, there are dangers inherent in AI. Every and any intelligent species (and at some point in time we have to classify computers running AI as a species) first looks out for itself. In other words, if it is truly intelligent, it will protect itself, and it will procreate. If humans happen to be in the way of AI protecting itself and procreating (i.e. making more advanced successors to themselves), then AI will simply see humans as a threat and do away with as many of them as necessary. Clearly, the genie is out of the bottle, and we'll never be able to put it back in, but we can try to find out how to limit the amount of damage AI can do. If you think that this is overblown, think again. Once our infrastructure is under control of a self-regulating and learning machine, what stops it from prioritizing its own survival over that of humans? And if it is truly intelligent, it will prevent us from switching it off.
I have been asked again and again why Hadoop cannot solve what we have set out to do. Well, there are two big ticket items that are in the way. I guess I'll have to come up with a simple explanation for why there is a disconnect. In very few words Hadoop only works on data where there are no dependencies amongst the data sets internally, and it does not work with floating-point since it was never intended to do that.
Database software has finally escaped the confines of its roots. NoSQL, NoSchema and in-memory compute demand other processor architectures than x86-64 and ARM and different types of memories, and we see requests from database companies for our products. It seems as if "Big Data" has many customers and large-scale users rethink what they do, and why. It is encouraging to see that they look for alternatives, which indicates to me that whatever they are doing now is not working well.
I would like to come back to an observation that stated that for a processor to achieve 1 FLOPS of floating-point processing performance, it would need to provide 1 byte/s of memory bandwidth. While Gene Amdahl disclaimed this particular observation, I think it is important to point out that there is a logic relationship between performance of a processor and its I/O bandwidth - this would be Kloth's Second Observation if unclaimed or disclaimed by Amdahl or collaborators. One FLOPS is defined as 1 floating-point operation per second. Today, we assume that all floating-point operations are 32-bit floats (minimum requirement) or 64-bit floats (more mainstream today). Most floating-point operations require two operands and produce one result, and the instruction is issued by one instruction word of 16 bit, 32 bit or 64 bit length. Most superscalar processors will need 64-bit instruction words to issue all instructions that are supposed to be executed in the following cycle. In our example, we will see two 64-bit operands being transferred in, one instruction word (64 bit) being transferred in, and one 64-bit result being transferred out. If any instruction is supposed to take one cycle (in superscalar architectures, greater than one instruction per cycle is normal) then any FPU will create one result per instruction and thus per cycle. In this example, we will need 3 64-bit words in and one 64-bit word out per cycle. That is four quad words per cycle, or 32 bytes of I/O per cycle. That's 32 bytes of I/O per FLOP, or if we divide all units by the time unit 1 s, then it is 32 bytes/s of I/O per FLOPS. This situation is aggravated in any processor that contains multiple superscalar cores and provides a very low number of DRAM interfaces. Only in situations in which compound and very complex floating-point instructions are executed this would not apply. I can think of sincos and tan as well as the hyperbolic functions as well as log and exp functions. What implications does that have? You won't see 3 TFLOPS of floating-point performance in a Xeon PHI with a mere 100 GB/s of memory bandwidth. Our numbers are not theoretical peak performance numbers - they are numbers that indicate sustainable performance.
It looks like funding for fabless semiconductor startups is still fairly elusive. Maybe we should change company direction to start a ride-sharing service that uses no infrastructure whatsoever but requires people to illegally transport riders from A to B, uninsured of course, and blow through $300M in promoting this useless and illegal service to people who may or may not want it. I just hope that the Directors' & Officers insurance will cover us when the first accident occurs, a passenger gets hurt, or worst case, gets paralyzed, and then sues everyone involved. Wait, such a company already exists, and it has received Venture Capital funding... I must be mistaken in the thorough vetting that the VCs put companies through then. So off to something else. Maybe a gaming company that markets a game with farm animals. Seems like a limited customer set... Oh wait, that already exists as well.
I had yet another lengthy discussion with a potential customer. The discussion revealed that most users - even well-educated HPC users - appear to believe that GPGUs can solve most computational problems. That's not quite the case since GPGPUs as of today are (and into the foreseeable future will be) SIMD renderers. As such, programmers have to adjust problems to the SIMD nature of the computers, instead of just writing software describing a mathematical solution to a problem. In other words, algorithms must be made to fit and suit the tool to solve the problem, instead of the computer as a tool in conjunction with the libraries and compilers be flexible enough to adjust to whatever the algorithm is that solves the problem best. That's pretty much akin to the situation in which a handyman only brings a hammer to the job site. I was at an event yesterday, and a general contractor said it best: You have to have the right tools for the job. If you need to drive in a nail, a screwdriver won't do. You will need a hammer. If you need to screw in a screw, a hammer won't help. You'll need a screwdriver. What a profound insight. VCs, though, think differently. To them, all tools are hammers, unless of course they are screwdrivers. The only good thing coming out of it though is that not only Sand Hill Road is losing traction, but the role and importance of VCs in general is diminished and will erode further. More startups requiring less money - albeit reducing the barrier to entry for competitors - and more incubators as well as crowdfunding will hopefully lead to a situation in the future in which VCs are not any longer needed.
A recent discussion (HPC Analyst Crossfire from ISC'14) between industry giants (Michael Resch, HLRS; Yoon Ho, Rolls-Royce; Jean-Marc Denis, Bull; Brian Connors, IBM; moderated by Addison Snell from Intersect360 Research) about the current state of High Performance Computing, its limitations and the future confirms what we have been saying for a while. HPC and Big Data need a new approach as the old one with General-Purpose CPUs and traditional SIMD GPGPU accelerators simply does not scale out to solve the ExaFLOPS challenge. The discussion also points out that all cloud infrastructure is incremental.
Looks like Uber made a classic Marketing mistake. Their mantra: If you receive negative news then dig up dirt on those journalists instead of figuring out why journalists (and possibly others, including potential customers) have a negative view. That is bad enough if it comes from an employee, but if such suggestions come from a Marketing Executive, then it is time to say goodbye to that person. It is hard to conceive that a company with funding as deep as Uber cannot afford to have ethical and excellent Marketing. Maybe Uber (really German for "Über") is going unter (German for under) already.
eBay has been posting their internal data on compute efficiency for a while now. They display how many transactions they execute per Watt used, and they update that in real-time. I find that very impressive. Unfortunately, eBay's core business is not doing so well (it seems stagnant), and they have decide that PayPal is going to be spun off. Both will be big businesses, and both require a large hardware basis to conduct their business. I am curious to see how they are going to go forward with their data centers, and if PayPal continues with the analytics and computational efficiency the same way that eBay did. In either case I believe both will be large-scale customers of servers with accelerators for their data centers. I cannot imagine that eBay or PayPal will run anything on Amazon's AWS, particularly not since Amazon has announced its own payment system.
I have not blogged a lot lately, so here we go. AMD's Rory Read has been replaced. While he managed to stabilize AMD financially, it seems that his grasp of what technology was required did not go deep enough. Unfortunately, AMD has been reactive to Intel's moves for too long now, and it is not really considered a competitor to Intel any more. That is really sad because a lot of good ideas and technology came out of AMD. Even recent acquisitions such as SeaMicro could have brought in the needed expertise, but instead it looks like the SeaMicro team has left, and AMD is back to square one. Let's see how that pans out, but the basic premise that most servers just don't need lots of computational horsepower seems right to me. If more floating-point performance or Big Data analytics is needed, just add coprocessors and other accelerators.
It looks like we received the official stamp of approval from Steve "The Woz" for our Flash-based and Hybrid Memory Cube-attached Very Large Capacity Memory. We tend to think that we have a good product with it and that it will enable a different way to address memory and file systems, in a fashion that allows for dynamic reallocation of memory between Random Access Memory and memory used for file systems for those applications that still need them. Here is Steve's feedback: "I really see well implemented flash/processor structures opening the door to software architecture changes." Clearly, Steve agrees with us that it will have a hardware and a software impact and that it will revolutionize how we see memory and file systems.
I have received a few emails lately that surprised me. There is also quite a bit of flak I received for making statements with regards to Venture Capital. I appreciate the questions and feedback, but to some degree I wonder if all the mis-information and PR that VCs have disseminated starts to have the effect of recasting them as successful with a stellar track record in the public eye. As far as I can tell this is just hype that the unsuspecting public believes, lacking better data. The National Venture Capital Association (NVCA) - not exactly in danger of being accused to be on the left side of the political spectrum - reminded association members that they will need to improve their return on investment. Surprised? Over the past 15 years VCs have returned barely the paid-in capital (PIC). The fees for managing funds are 3% per annum on the fund. At an average fund life of 7 years VCs collect 21% of the PIC. For them to return the PIC, they just need to make 21% on the fund over the life. Assuming that they can create a 20:1 return on one big hit, all they need to do is not to sink money on most of the rest. Clearly, they have not. Do you still believe that VCs are smarter than the rest of all investors?
What can we derive from this? Do not put any money into managed mutual funds because all those people collecting fees do is funnel the money to VCs, without checking if there is a return greater than an unmanaged S&P500 ("index") fund, or the national inflation rate. 21% of your money goes to people who have more than enough of it already, and who have proven over the past 15 years to not return more than the paid-in-capital. The same is true for life insurances and any other money market instrument that relies on VCs. After all, it is your hard-earned money, and you will need to rely on it and its return once you retire. In other words, this is not a laughing matter. It is crucial to your financial well-being. Don't rely on hype, on hearsay, or on statements that cannot be substantiated. Look at hard numbers. While the S&P500 may not have been perfect its return was better than that of nearly all managed funds. On top of that, VCs outsource jobs to people who largely don't pay in into US retirement funds. Do you feel comfortable funding people who compete with you?
Once SSRLabs has made it past seed and an A Round that I may have somehow put together, hell will need to freeze over for me to get VCs involved in a B Round. If VCs don't want to take risk in seed or in A, then I can go to any bank to underwrite a B Round - there is no need whatsoever to involve a VC in a B Round. In other words, if they refuse to fund a seed or an A Round, I do not need them for a B Round - and if all startups think that way, VCs will deservedly end up as roadkill .
One more time we were (and still are) ahead of IBM, Intel, AMD and nVidia. We have been developing and designing a Neural Network Coprocessor that is not a Turing Machine, not a Finite State Machine, not a von-Neumann architecture, and certainly not a Harvard architecture. So the article that an "IBM Chip Processes Data Similar to the Way Your Brain Does" is not new, and it sure is not surprising to us. We have been saying for a while now (actually since 2004, discussed at the Asilomar Microcomputer Workshop) that for all parallel processors von Neumann and Harvard architectures need to go out the window, and while we are at it, Turing Machines (Finite State Machines) should go too.
Doing the same thing over and over again and expecting a different outcome is the clinical definition of insanity. That's all I can think of when I see some of the grant RFQs these days. The basic premise of the DoE and DoD and others is that whatever we do today does not work. In Big Data, in HPC, in a lot of other areas the solutions don't scale out linearly, they consume too much power, and they are difficult to program and to manage. It just does not work. So, to overcome this, let's formulate a goal: Achieve at least 1 ExaFLOPS at 20 MW or less. Submitted and publicly available are documents and presentations from IBM, Intel, AMD, ARM and nVidia. All of these formidable companies restate the obvious. No matter what, they can't get to 1 ExaFLOPS at below 20 MW in any process or any generation of their processors. In other words, all of the established companies say that something new needs to be done. All participating startups say the same: established architectures just don't cut the mustard, and something fundamentally new must be done. Then the awards come out. The money goes to POWER, x86-64, nVidia/CUDA, Xilinx and Altera. In other words, to those who said they could not do it. We have seen this over and over again. Insanity? Do we really want to change things? India, China and Russia have committed to new processors. The US? Not so much. Is that where we want to be?
SSRLabs has been asked many times by computational microbiologists (among others, which by now is a long list...) when we'll have our coprocessors ready for sale so that these researchers can deploy them and vastly accelerate the process of finding individualized cures to cancer, Alzheimers' Disease and the Autism Spectrum Disorder. While we'd like to deploy them and supply those coprocessors to these phantastic researchers that try to improve the world we need money to do so. VCs have been unwilling to fund us, with the reasoning that our ROI would be too low or it would take too much money. Yet the same people pump over $300M into Uber, where there is no anticipated or visible ROI other than that Google may buy them. I can only hope that those same VCs will understand that many diseases, including cancer, are curable with our technology being the basis for individualized medications developed by the computational microbiologists that have repeatedly asked us. I know that this will happen. Individualized medicine is the next big thing simply because the average life expectancy keeps going up. With an increased life expectancy comes a larger bill towards the end if we do not find individual cures.
I am flabbergasted by some of the developments I have seen in corporate valuations. A few examples: Uber makes nothing, has no infrastructure, and the barrier to entry is about zero for any competitor. Yet some people think it may be worth $200B at some point in time. Not that I like GM, but they actually have fabs and employ people, and they are valued at $55B (market cap as of today) - and I think that is overvalued. In what universe would Uber be worth about four times as much as GM? What mental midgets are these people to make such claims? I am told by VCs that the value of a company is 4 to 7 times the trailing revenue. That seems somewhat reasonable, but why does this thinking goes out the window when something intangible is looked at? Uber is a business plan, and a bad one at that. So are a few others. They try to disrupt a low-margin, highly regulated market. You'd have to be borderline institutionalized to go with the plan and believe they can be worth anything. Granted, I don't always trust taxi cab drivers, and most of their driving skills appear to be taken directly out of a movie in La-La-Land, but at least their vehicles are inspected (albeit superficially) and they carry insurance. Uber? I have to rely on the driver's insurance. Same for AirBnB. If I am on vacation why would anyone think it is a good idea to rent out my house? Whoever is there has full control of my house, my belongings, my possessions, everything. More importantly, if the tenant is there for more than 30 days, in California certain rights kick in, and I can't get rid of the tenant. So I'd have to rent an apartment. That has got to be the most mentally retarded business plan ever, and it relies on finding people who have no regards or understanding of any law. Nevertheless, investors appear to think that this flawed business plan has value. The only thing I can think of is a pump-and-dump scheme, and last time I checked, they were deemed illegal. So VCs push illegal pump-and-dump schemes? If I remember correctly, the last time global societies had an idea of sharing possessions was when Karl Marx and Friedrich Engels wrote "Das Kapital". Marxism and communism failed, predictably, because humans just don't work that way. Why do VCs (Venture Capitalists) now assume that it will work this time around? And why do VCs push the opposite of what their name stands for? Do they think we (investors, employees, contributors to life insurances, mutual funds, founders, inventors) all are crazy and stupid? Whoever is backing them and advising them must not have read or understood history. As we all know, failing to understand history makes us vulnerable to repeating the same mistakes that others have already made.
Computer Science is somewhat stuck with the 50-year old paradigm of the Turing Machine, or Finite State Machine. That paradigm works fine for a small ASIC or a processor with a very limited instruction set. It does not work well for large processors - RISC or CISC - and most importantly, it does not work for parallel processors. For massively parallel processors, it completely fails us. Why? Just do the math on an array of n processors with m states each, all with data and status dependencies. A relatively small number of cores with a relatively small number of statuses each will lead to a status explosion if there are dependencies and any core can be connected to any other. That's in essence what neurons and synapses are - and we know that they are not deterministic and they certainly do not qualify as Finite State Machines, neither as an individual neuron, nor as a collection of them.
SanDisk acquired Fusion-IO for about $1.1B. I believe that SanDisk had a lot of great products, and the same is true for Fusion-IO. Both corporate cultures are by themselves appropriate for the target market, the required speed of innovation, and the necessary margin as well as the number of layers of hierarchy. Combining them could result in infighting, and the combined entity is at risk to fail. The same way that AMD still is the "green AMD" for processors and the "red AMD" for ex ATI (minus the VR GPUs that are now in nearly all smartphones) SanDisk will continue to be made up of those two entities. The SanDisk part might continue to push for advances in SATA-attached SSDs, whereas the Fusion-IO part probably will fight for PCIe-attached solid state memory, de-emphasizing the "disk" part of SSDs. With the internal rivalry it is hard to imagine that this new entity can efficiently focus on advances beyond PCIe-attached solid state memory.
VCs seem to see software as the solution to everything these days. I am just waiting for the day that a General Partner of a VC company asks for a firmware update and a software patch when he encounters a flat tire instead of calling AAA (or any equivalent). I am starting to wonder if many people - some programmers included - still know that all of the Java code they write is ultimately executed by a physical processor. Considering the pervasive use of High Level Languages (HLLs) on virtualized machines executing code that may not be native to the processor and that gets translated in near-real-time to a native instruction set I may be naive to assume that this understanding is still required or useful. However, that leaves them vulnerable to scams such as the software that claimed to double the memory capacity a decade ago.
It looks like Tilera was snapped up for a song. Deservedly and rightfully so, I might say. I hope that at least the employees got a decent deal out of the acquisition, whereas I can only wish for the VCs to have gotten burnt. Tilera is one more example of VCs chosing the wrong company. The article "EZchip to Acquire Tilera, a Leader in High-Performance Multi-Core Processors" describes the terms and circumstances of the acquisition. There is a noted lack of application and deployment areas of the Tilera architecture, and that is no wonder. It has so many architectural flaws that it really is difficult to see where it can be used effectively and efficiently. Tilera is yet another example of funding going to a company that claimed to do real-time image and video analysis despite its architecture being incapable of doing so. Both Tilera and Stream Processors Inc targeted that market, and both failed at fulfilling the promise. Movidius and MobilEye excelled in that field. For Tilera, every subsequently targeted market failed them as well. I am glad that there is one less competitor since it is obvious that EZChip will retarget the entire product line towards Deep Packet Inspection (DPI) - the only area where it will work well enough.
I found an interesting article today. The article "China's Supercomputing Strategy Called Out" is effectively saying that Tianhe-2 is a nice piece of hardware but largely not useable because no software exists to make use of it. That's of course BS. The same challenge applies to any other supercomputer as well. All of them require software to be tailored to that particular machine, which in turn highlights a problem that no one likes to talk about. In supercomputers any computational problem must be tailored to the specific machine. The differentiation between hardware, firmware, operating system and application software is not as strict as one would expect. Ideally, a programmer would write his application program once (be it protein folding, simulations for the cure of cancer or Alzheimer's) and just use Application Programming Interfaces (APIs) for the computationally hard parts that we provide. These parts would run on our coprocessors and schedule and distribute the load automatically, meaning that the exact same application software runs on any accelerated computer, whether it is equipped with 1 of our coprocessors or with 10000 of them, or even more. It does not matter - we will take care of the housekeeping. This is akin to cars: Who these days knows how an internal combustion engine works? And largely there is no need to, they just function and we just use them. Astonishingly, supercomputers don't work that way yet
We have been asked about BitCoin Mining a few times. Quite a bit of money is being invested in BitCoin miners, which essentially are massively parallel engines computing hashes. These engines can be used for BitCoin mining, but they have a second use. They can easily be misused to compute hashes for passwords. These engines can generate billions of hashes per second. In those scenarios, all passwords of a specific length are computed and these hashes are stored in a database. The advantage for the hackers is that the size of the database with the hashes is substantially smaller than a database with all possible passwords of any given length would be. Oftentimes a database containing personal data such as names, addresses, credit card information and more is encrypted with an administrator's or a system password. All that is needed is a match between the system or the administrator password and the pre-computed hash for a large-scale breach to occur. Having hashes available basically means that only those hashes stored in a database need to be used to decrypt the user database, and then the same methodology can be used to decrypt the individual user credentials. As a result, even those databases that are commonly assumed to be safe are vulnerable. BitCoin miners make these extremely vulnerable. A challenge protocol would be the only way out of this dilemma. Until then, all user databases must be assumed to be vulnerable to attacks with BitCoin miners and pre-computed hash databases.
SSRLabs' TAB has discussed current memory technologies particularly in light of current multi-core processors and many-core processors. For massively parallel processors and any of the above executing multi-threaded applications current memory simply does not have the performance (bandwidth and latency) that those processors demand. In other words, scaling performance of processors using memory will not work since memory is the limiting factor. While that is not new the magnitude of the discrepancy is astonishing. As a result, we have made sure that our Very Large Capacity Memory supports a very large number of multi-threaded cores, even in shared memory applications that are the underlying hardware for large in-memory compute deployments such as SAP's HANA.
It looks like Amazon has quietly hired the Calxeda team in Texas, basically getting Amazon into the chip and processor design business. It makes sense since now Amazon has both ends of the communication (and value creation) chain: cell phones, tablets/readers and servers in massive data centers, for AWS among others. If Amazon wants to compete with Apple and Samsung it cannot simply keep buying Qualcomm processors for the cell phone (and of course the Kindle), and neither can it continue using Intel Xeon processors in its data centers. Designing its own processors makes sense for Amazon, although that move still really surprised me. I would have thought that AMD would buy Calxeda since it would have perfectly complemented its SeaMicro acquisition. On top of that both AMD and GlobalFoundries are owned by ATIC, and GlobalFoundries was a big investor in Calxeda. Well, I was wrong. Amazon saw more value in Calxeda than AMD did.
I don't know - may be I am getting old, more demanding, or the desire to build robust systems is going the way of the Dodo. Here is what happened today. We have a whole bunch of servers. They are kept alive in case of a power outage by industrial-strength Uninterruptible Power Supplies or UPSes. The less crucial systems were on a consumer-type UPS. I of course did not expect the consumer UPS to be as robust or reliable as the industrial-strength UPSes, but I had no idea that consumer UPSes actually decrease uptime compared to just the unprotected power grid. In every UPS there is a battery - mostly a lead-acid battery. Their lifetime is about 3 years, plus or minus a few weeks. That is not great, and it is not economic if something irrelevant is protected by it, but it is acceptable for systems that are not crucial but need to be available even if utility power is out. It seems that one of the consumer UPSes had decided that its residual battery capacity was below a preset threshold. That is acceptable. The fact that it started beeping in the middle of the night instead of sending me an email is not. When I am not in the office, I cannot hear the beeps. More importantly, it decided that after some time of beeping without an immediate battery change it would be appropriate to shut down all output power. That is not acceptable at all. Think about it: You have Grandma's ventilator on this UPS because you want to make sure that a 15-minute utility power output won't kill her. Three years after having put the UPS in, the battery is below a threshold, and it starts to beep. You can't get a replacement battery immediately, but you think that for as long as the unit is powered by utility power, Grandma's ventilator is going to be protected. It just will be an issue if utility power goes out. So you run off to your local Fry's to get a new battery. Well, Grandma was not protected after all. After a certain period of time less than 4 hours, the unit shut down output power although it did not need to run on batteries. No output power = ventilator not working. Whoever designed this UPS needs to rethink that design. No matter what, that is the wrong failure mode. A UPS must continue to forward power to the consumer even if there is no battery, the battery is dead, or the battery is below a threshold for as long as there is input power. A failure mode that shuts off output power even if there is input power during a battery failure is unacceptable. Unfortunately, this kind of thinking is becoming pervasive. It is terrible to see this in software. We have all seen the "Blue Screen of Death" and come to accept it. There is no reason to accept it, not even in software. It is deeply disconcerting to see this appearing in hardware now too. What we need is better robustness and deterministic behavior. The more functions we outsource to computers, the more robust they have to be. Currently, it is the opposite: everything is probabilistic.
At SSRLabs our focus has always been on designing processors that allow large amounts of memory to be addressed. We knew that large memory arrays with low latency and a large number of open pages would be required. So we developed technology that would allow memory arrays to fully support "Big Data" and all types of in-memory compute applications, including Cassandra, Hadoop, SQL and NoSQL databases. Now we have received confirmation that our technology is not merely a solution chasing a hypothetical problem, but is in fact projected to address a huge market. The market research company MarketsAndMarkets has projected that the In-Memory Computing Market will be worth $13.23 Billion by 2018. Processor designs at SSRLabs have produced incremental IP that makes use of the Hybrid Memory Cube (HMC) infrastructure and extends that to storage even denser than DRAM, specifically Flash memory. With our intellectual property and silicon products we can address the issue of extremely large memory arrays while at the same time keeping power consumption and thus excess heat production at bay.
It looks like we have been right again. TechCrunch published an article about a new 3D vision chip. See "Inside The Revolutionary 3D Vision Chip At The Heart Of Google's Project Tango Phone" for more information, but that is actually a derivative of what I have architected and patented with my prior company, Parimics, way back in 2002. The patents are described in detail in the US Patent Office's online repository. First of all, it's a coprocessor so it does not have to run all the tedious things that are of really no interest to compute. Second, it uses effectively two layers of distinctively different types of compute - a neural net and an SMP-based engine to derive motion from the objects. However, we did this 12 years ago, and we have progressed way past that. We also had to use custom cells since the existing libraries did not provide enough performance at any acceptable level of power consumption. Our current architecture is substantially more capable. We'll see how this is recognized.
I just found these two articles ("Rambus Founder Opines on Semiconductor Industry's Future" and "Is Micron Gunning for Intel's Processor Business?") by Jim Handy in Forbes' Magazine. They are in a lot of ways real gems. First, Jim reports on Mark Horowitz talking about energy-efficient compute at ISSCC 2014 in San Francisco. It's great to hear that the Rambus founder arrives at the same conclusion that we have. Second, Jim really understood not only the problem and the challenge the semiconductor industry faces, but he got the implications right. We need different processors, different memory, and most importantly, wherever we can, we must identify code snippets that are used over and over again, and provide hardware in form of an ASIC or a fixed-function block in a processor that executes the same algorithm at higher speeds and lower energy consumption than software in a general-purpose processor. I am glad to see this in a mainstream publication. The second article highlights Micron's Automata processor. Again, while the processor is not a general-purpose processor and does not quite do the same as SSRLabs' coprocessors, I am glad that the notion of "every computational problem we have can be solved in software on an ARM or x86-64 processor" is debunked even in a non-technical mainstream news magazine.
Google has announced that they are going to sell Motorola Mobility to Lenovo. A lot of speculation surrounds this divestiture. The only people who really know are the Google Executive Management team and their Board of Directors. Everyone else can only speculate, and that is dangerous because they don't have the full picture. Google's Executive Management team and their Board of Directors have so far shown that they are led by data, and they have this data. Everyone else likely has incomplete or inaccurate data on why Google seemingly did not need Motorola Mobility any more. What do we know for a fact? Motorola Mobility was hemorrhaging money, and it created a competitor to Google's partners and the ecosystem around Android. Google also retained the vast majority of patents, allowing it to defend Android.
HPC and "Big Data" both have the potential to change compute. The requirements for throughput may just be too big for traditional servers. QPI simply does not allow for good-enough scaling. SSTL-2 and DDR-3/4 are just not fast enough to support the requirements of "Big Data" and HPC. Even memory sizes of today's processors are insufficient for large computational challenges. As a result, current processors rely on mass storage, which is remote memory with a file system. Even in the most modern systems today this mass storage is attached through PCIe instead of being direct access memory to a cluster of CPUs. The latency CPU -> QPI -> PCIe -> PCIe Flash is huge. While it is better than CPU -> QPI -> PCIe -> SATA Host Adapter -> SATA Flash or worst case hard disks it is vastly slower than direct-attached memory. SSRLabs' coprocessors support direct-attached memory, and they can work with storage appliances that provide storage with file systems where required
I had an interesting discussion with operators of data centers for HPC and "Big Data". Both require fast access to large amounts of data, high bandwidths between I/O and processors, and very large compute capabilities. HPC additionally needs very high floating-point performance, whereas "Big Data" oftentimes gets along with integer processing, but it needs to return results at lower latencies. Scaling out performance of either system must take into account balance of I/O, processing performance and inter-processor communication required for the computational tasks. The latter is often overlooked or ignored, but it may have the biggest impact on system performance.
It looks like we are seeing a renaissance and resurgence of hardware. After all, hardware poses a barrier to entry - software does not. Anyone can write code in Java or Scala or any other language. However, if your software runs on specific hardware that's not available to others, you have just created a huge barrier to entry for anyone else. Google has just announced that they have agreed to buy Nest Labs for $3.2 billion. Google's software alone (Android@Home) did not make it, instead hardware was required. That was worth $3.2B obviously, at least to Google. For more on this CNN Money has an article here: "CNN Money reports that Google will acquire Nest Labs"..
It looks like 2014 may see the emergence of ARM-based multi-core server CPUs. That makes a lot of sense in applications that don't require lots of computational performance, such as those for file and stream services or proxies. It will cut cost, power consumption and to some degree complexity. Despite Calxeda's unfortunate demise I think that SeaMicro and amcc are well-positioned. That is good for us too, because it means that servers are going to be more task-specific, and that computational servers are going to be a new breed, somewhat along the lines of the engineering workstations that were required for quite a while to solve engineering and scientific problems.
First of all, Happy New Year! Second, with all the new security breaches, this time including Target and SnapChat, I wonder why no one does the math. Let's say it costs $1M to penetrate three layers of security (firewall, internal network, root or admin's server account). If that is a given, then breaking into a small company will not yield $1M in profits for the intruders. So no one will spend $1M to illegally make $100K. But if for the same $1M these intruders can get into Target, SnapChat or others, and they may have access to hundreds of thousands of customer records, they will be able to sell those assets even on the black market and illegally for much more than $1M. Would not that make the case against any big hoarder of data? Would not that necessitate purging of credit card and customer data 30 days after its use?
I have run a few simulations that in fact show that there is direct correlation between DRAM and I/O bandwidths and performance in GFLOPS in a processor. As expected, Caches only mask the problem of having too few registers for compute and communication. The situation is a bit more relaxed if higher-level APIs such as openCL and openACC are used as the granularity of the task is coarser, and compilers can optimize compute such that less data movement is necessary. Therefore, it is mandatory that processors have enough internal registers, enough bandwidth to memory, and that the interconnects between the processors are low-latency and extremely high bandwidth. PCIe Gen3 interfaces in 16-lane versions provide 15.75 GB/s of bandwidth at fairly high levels of latency, and even Intel's QPI only goes up to 25.6 GB/s. Four QPI ports therefore max out at about 100 GB/s - and that is a peak data rate. That's by far not enough.
Merry Christmas everyone! Anyone who knows me knows that I have very little regards for political correctness or for religion - but Christmas ultimately is a pagan tradition, predating all major religions. After all, when is the last time you saw fir trees and snow in the Fertile Crescent? Anyway, enjoy this tradition, which ultimately is a celebration of the winter solstice, and as such, it definitely will have focused on life, on why we are here, and on what the purpose of human life is, and celebrating that we (as the human race) had made it through a few hundred thousand years, in other words, it will have focused on human ingenuity and curiosity. Put yourself in the shoes of people over 2000 years ago: They had figured out that summer and winter solstices occur on about 360 day periods. They noticed that these solstices coincided with the alignment of some stars and some constellations in the sky that they could not quite figure out what they were. Would not you be in awe and celebrate this if you had survived the year having gathered seeds, planted them, watched them grow, harvested them, and put the harvest in winter storage? I sure would. (And of course maybe you or your buddies would have figured how to make beer - which is essentially watered-down fermented honey or grains with hops to preserve it!)
It looks like Calxeda has shut down. eWeek reports that "Calxeda, Pioneer in ARM Chips for Servers, Shuts Down". On the surface, that is bad news because it is yet another semiconductor company that bit the dust. Unlike what their VP Marketing, Karl Freund, states I do not think that this is because they were too early to the market. I think it is because they have delivered an unsellable chip - a 32-bit-only server processor. The market segment they were after is a segment in which very low computational performance is required (that's why the ARM architecture could shine with low power and low performance), but it needed a large address range, especially since it was a multi-core CPU. I assume that the market segment was the appliance market, particularly the caching appliance. On a total $103M investment that is not what the market had expected. If I compare that to SeaMicro, amcc and Marvell Calxeda was behind the curve. This paints a more comprehensive picture: Calxeda had the wrong product, and whatever of the IP proves to be reusable will be acquired. This is a normal process, and particularly in semiconductors consolidation is key. On top of that I think it may have dawned on ATIC - which owns AMD and GlobalFoundries - that they had two very similar and competing companies in their portfolio. AMD had bought SeaMicro a while ago, and GlobalFoundries was invested in Calxeda. That did not make sense. So maybe now SeaMicro will get what is left of Calxeda for cheap - or Google picks up the bits and pieces. Both outcomes would make sense.
There is more proof that Google again is leading the pack. Google started out as a software company, with an idea of how to search. It then morphed into a software company that had to operate its own data centers, and in the beginning seems to have been at unease with that. Now the entire organization is very comfortable with hardware. Google then acquired Motorola Mobility, and it became clear very quickly that it was not only about the patents. In the mean time, Google has grown comfortable with developing its own servers, networking hardware, and ultimately even the systems behind the autonomous vehicles. The camera systems for Google StreetView - all developed inhouse. Google Glass was developed within the company. Now Google is buying the leading robotics company, Boston Dynamics. Very clearly Andy Rubin and the Executive Management at Google have understood that owning multiple verticals will make Google stronger than owning multiple horizontals. In other words, hardware, ASICs and processors are back.
Within the HPC and Supercomputing Group on LinkedIn there was a very interesting discussion, indicating to me that a lot of challenges in HPC are either not understood at all, or misunderstood based on faulty data fed to the informed public by Marketing Departments, spreading FUD (Fear, Uncertainty and Doubt). The first allegation is that coprocessors aren't desirable in and of themselves. That is an observation that can easily be proven wrong. A general purpose CPU has to be able to deal with any task that comes up in a computer, even a server, be it user interface, GUI, file I/O or anything administrative. In contrast, a coprocessor or accelerator can be designed, built and optimized towards a specific workload, in the case of HPC of course towards solving complex and large floating-point problems. Therefore, a large percentage of the 500 supercomputers on the top500.org list use accelerators of one kind or another, particularly x86-64-based systems. The top500 list shows 52 out of 500 (>10%) and 4 out of the first 10 (40%) using coprocessors or accelerators already. Just comparing the number of cores we see that the top 10 out of the top500 have a combined 8322830 GPCPU cores and combined 3437806 coprocessor or accelerator cores, which amounts of 41% of the combined GPCPU cores. The same exercise for all of the top500 machines gives a total of 20720693 GPCPU cores and 4585300 coprocessor or accelerator cores, or 22% of the GPCPU cores. In other words, coprocessors account for between 22% and 41% of the GPCPU cores in all of the top500/top10 supercomputers on the planet. Xeon PHI, nVidia, AMD/ATI are in use today, and even IBM's Cell Broadband Engine was in use as an accelerator for quite a while. The second objection seems to be that getting data in and out is often a performance bottleneck. That an interesting and completely false assessment. If one CPU is not fast enough, multiple of them are ganged together. That is done through QPI for Intel processors. Peripherals - including Ethernet and InfiniBand Network Interface Cards (NICs) - are connected through PCI Express Gen3 in 16-lane variants. If the server becomes the file I/O frontend and the accelerator has more memory and higher performance, the bottleneck - even if it exists - becomes irrelevant. I will do the math: Linking servers together at 12.5 GB/s (assuming 100 Gbit Ethernet, no encapsulation or protocol overhead, no CPU load, whereas in reality most supercomputers use QDR InfiniBand at 40 Gbit/s) is considered ok, and of course those NICs would have to use PCIe Gen3 with 16 lanes at 15.75 GB/s. However, using that exact same interface to the coprocessor is allegedly not providing enough bandwidth. Alternatively, we look at QPI, the fastest interconnect Intel has to offer as a processor-to-processor interconnect. At 3.2 GHz core clock it gets to 25.6 GB/s (which of course also explains why there are no PCIe Gen3 32 lane add-on cards!). In other words, the PCIe Gen3 16 lane interface gets to over half of the highest interface bandwidth in an Intel Xeon-based system. I wonder how that correlates with statements from Marketing Departments of not being fast enough for data I/O, whereas connecting servers through any realistically much more latency-laden variant of Ethernet or InfiniBand is?
I think by now a lot of people may have heard that Google is presumed to develop their own processors. I would like to explain why I think that would make sense. First, since Google is an architecture licensee of ARM (through Motorola Mobility), they could add functions to the ARM core and thus to the instruction set, and they could add specific coprocessors that offload some of the work done otherwise in software. This would create an even higher barrier to entry for any competitor, even if all of the software remains open source. It therefore shuts out competition and is good business tactics. Second, they could use those processors to reduce the cost of their servers, and potentially cut back on energy use. Again, this is a sound business decision. Third, it would force Intel to lower prices on Xeon to Google for those servers where the computational power of the Xeon still is needed, and it would force Intel to step up competitiveness of their Atom processor, both in price and in performance. This is yet another good piece of business tactics. The development cost can be entirely written off as R&D cost, and since Google's search and nearly all other algorithms are pretty much set, implementing them in a fixed-function piece of combinatorial logic or a coprocessor - possibly in conjunction with one or more hash engines and TCAMs - does not pose a large risk in execution. So if Google in fact did it, they'd reap enormous benefits with very little to no risk. That would be yet another example to prove that software functions migrate to hardware once they are stable, and it would again show that hardware (including ASICs and processors) is the underlying engine that makes things work, both technically and economically.
Yesterday evening I attended a presentation that the great team of The Hive had put together. Two Facebook engineers explained the genesis of RocksDB. In short, they needed an application to retrieve data from a database server. The initial bottleneck - spinning disks - was easily overcome by replacing them with SSDs. That opened up a new problem which was the communication latency between the application server and the database server. Instead of using TCP and UDP offload NICs, they pulled the database into the application server and encountered cramped main memory, forcing them to scale down the capabilities of the database. While latency goals were now met, the resulting database is not scalable nor is it capable of High Availability. Substantial amount of custom software design time was spent. It looks like it would have been faster, easier, cheaper and much more straightforward to use SSDs, TCP and UDP offload NICs and coprocessors with a much larger address space. I think that this serves as yet another example to prove that developing software to solve a problem is not always cheaper than developing and deploying new hardware.
Larry Hardesty from MIT pointed out in Scientific Computing one more time how important matrix multiplications are. His article Matrices Have Broad Ramifications in Computer Science highlights the need to support matrix multiplications in a fast, efficient, effective and easy-to-use fashion. We could not agree more but would like to add that Fourier Transforms and convolutions belong into the same category. Our coprocessors make matrix multiplications available without any need for the user or programmer to write matrix multiplication functions - they are provided with the SDK, like all other matrix and vector as well as scalar functions.
Following a LinkedIn High Performance & Super Computing Group discussion What's Next for GPU Chips? Maybe the Network. a reader touched on parallelism with data interdependencies that occur in FEMs and FEAs, both of which are not handled well by GPCPUs and GPGPUs. Our coprocessors are explicitly designed to handle those datasets and algorithms optimally. Another reader (Keith Bierman) commented on my comment from earlier on that JVMs are not real processors and pointed out that there have been physical implementations of processors that execute Java natively. He also rightly pointed out that there are no constructs in Java that would allow a compiler or a processor to evaluate a parallel expression or a parallel dataset. That's correct, and as such, mapping parallel datasets or a parallel expression must be done by the compiler in conjunction with the processor (and thus requires the compiler to have intimate knowledge of the underlying hardware) by evaluating the software, data sets used, and make assumptions about parallelism. So if parallelism is to be advanced in its deployment, then we'll have to design parallel processors (that's what SSRLabs is focusing on), write operating systems that make use of the underlying parallel hardware, design compilers for new parallel languages or extensions to existing languages that allow for programmers to give explicit instructions to the compiler to treat dataset or threads as parallel, and train programmers so that they identify parallelism in data and in tasks or threads. Only then will we be able to make use of parallelism in mainstream data processing. I expect that "Big Data" will help us get there, because without real parallelism and massively parallel processors most "Big Data" applications won't run fast enough to make a difference in business analytics. For HPC, openCL is already established and being used, and we plan on having an openCL API available with our coprocessors.
When looking at our logs, the number of unique visitors and the depth of the page views I found an interesting bit of information. It looks like most visitors - currently over 22900 unique visitors per month - look at my blog first, and about a third of them then looks into our product portfolio. They seem to find us through search engines with the term "massively parallel" most often (over 90%). It appears to me that there is a need for "massively parallel" coprocessors since that is the search term that gets us discovered. It indicates to me that at least some percentage of users of traditional processors is unhappy with the performance they see - why else would they look for alternatives?
I could not resist to write a comment to ARM's statement that a 64-bit processor is good enough for the time being. I agree - and here is why. A 128-bit processor with an address space of 2128 bytes would be able to address every single particle in the universe multiple times over. In fact, even if we assume that dark matter and dark energy make up over 90% of the universe, and that whatever we see consists of 1079 particles, a 128-bit processor would be able to address every particle 1040 times over. In other words, it would address every particle with a remainder of addresses bigger than a 32-bit processor could address in total. So yes, a 64 bit address space is big enough for the time being!
Considering that Intel just made another announcement of yet another Xeon PHI (Knight's landing, due out in 2015) with yet another record-breaking performance I could not resist to comment on that one either. First, they combine a processor that has a TDP of >150W with stacked 3D DRAM on top of it. Second, its combined I/O and memory bandwidth is still only ~400 GByte/s. Third, they claim a performance of 3 TFLOPS. I'll address these things in sequence. If they stack a DRAM stack on top of a die dissipating 150 W or so, one or two of two things will happen: the DRAM R/W failure rate will go up, or the DRAM stack will lose connectivity and physically fail due to diverging thermal expansion. Most likely both will be observed. That's not good for HPC with process runtimes of multiple days. A combined I/O and memory bandwidth of 400 GByte/s is about two and a half times of what today's Xeons have, and that's not good enough. 400 GByte/s is good enough to feed 25 billion times two operands of 64 bit floating-point numbers into the processor per second. That's not even counting instruction words. In other words, its performance is limited to 25 GFLOPS if one floating-point operation is carried out per cycle in which two operands come in and one goes out. As far as my math goes, 25 GFLOPS is a far cry from 3 TFLOPS. Not relying on Intel processors' math, I'd say that 25 GFLOPS is 1/120th of the claimed performance of 3 TFLOPS. When is the last time you bought something where the supplier got away with giving you about 0.8% of what they said they would? If you buy a 240 HP car, you want all 240 horses to be on board. Not 235, not 230, you want all 240. They give you 2.
I have been asked multiple times why "Big Data" requires new paradigms to work well. To answer that, I want to first explain what "Big Data" is. "Big Data" is usually understood to mean large amounts of data (and of course over time the definition of what big is will change!), unstructured data, and a large growth rate of that unstructured data. Systems that deal well with processing "Big Data" therefore need to support mass storage that is large and can be accessed with low latency and at a very high bandwidth. Usually, that rules out any traditional hard disks, even arrays of hard disks, and calls for arrays of SSDs, ideally SSDs that are not impacted by management traffic for internal maintenance. Hard disks - even 15K RPM SAS hard disks - simply are not fast enough. It also requires multiple servers connected through 10 Gigabit Ethernet (strictly speaking of course 10 Gigabit/s Ethernet) or better such as InfiniBand or 100 GbE to form a cluster, and here latency counts as well. Any time saved in communication for MPI (Message Passing Interface) and the like improves system performance drastically. NICs with offload engines for TCP and UDP help quite tremendously. Current processors don't fare well with processing unstructured data since by definition there is very little locality in the data. As a result, clusters that execute "Big Data" applications should use massively parallel coprocessors that have very large addressable random access memory, and they must be able to work well (i.e. they must not stall) in environments with very little locality of the data. In-memory compute is a requirement for good performance of a cluster processing any kind of "Big Data". DRAM alone may not be good (i.e. dense and energy-efficient) enough for that, and so solutions such as the Hybrid Memory Cube or similar memory architectures with stacked dice and Flash in the address space of the coprocessor are desirable. All of these constitute a new paradigm in compute, and existing processors cannot and do not fulfill those requirements. SSRLabs' coprocessors have been designed with "Big Data" in mind.
Some of the most recent discussions that I had lately with software designers and programmers are a bit disconcerting. They seem to indicate that the use of high level languages has contributed to a loss of knowledge about the underlying processors. There seems to be an assumption that any problem can be solved with any processor, and that there is no real difference between processors, coprocessors and accelerators whatsoever. That is simply not true, since memory bandwidth, latency, contention, congestion management, scalability, inter-processor and inter-process communication and energy- and instruction efficiency heavily depend on the underlying architecture of the processor. While it is nice to have an abstraction layer between programmers and the hardware that the software runs on, it is necessary and mandatory that programmers understand what implications exist. A Java Virtual Machine (JVM) is not a physical processor; it's a virtual machine. That means that a real physical processor executes the emulated instructions that a JVM requires, and maps them onto its own instruction set. Some instructions can only be mapped onto a processor in a sequential fashion, even if the problem is inherently parallel (and even if the language supports parallel expressions). Such a processor will execute native code and therefore C, C++, Java or Scala code substantially slower than a parallel machine using a parallel JVM or compiler with parallel C, C++, Java or Scala code.
To stay with the tradition to discuss controversial issues here I'd like to go back to the RISC-versus-CISC discusssion. Those that have read my blog and understand Kloth's Observation may be able to predict what is about to come. First, I am glad the RISC-versus-CISC wars are over. Second, I am equally glad that the best of both worlds has come to dominate processor design. We now have processors that support large instruction sets (and not surprisingly, the largest instruction set can be found in a RISC processor), and some of those instructions point to hardware engines that execute certain functions as native and possibly atomic instructions. Those functions are complex mathematical operations that were unthinkable to be implemented even just a decade ago. FFTs, matrix multiplications and most video codecs are commonplace functions in processors and DSPs today. They used to be macros or function calls comprised of hundreds of simple native processor instructions just a few years ago.
It looks like HPC is oftentimes understood as just scientific floating-point processing. I think that this definition is too narrow to capture reality. In both traditional HPC and in "Big Data" applications require and produce large amounts of data, and they need to be processed in a short period of time. Processing that data is complex and computationally hard. I/O poses a challenge since the I/O rates vastly exceed simple hard disk arrays. While not every "Big Data" application will be even close to the requirements that for example CERN poses, some will approach the size and complexity of a Zynga, Facebook or even Google - and that's HPC in my book, even if little to no floating-point algebra is involved. That is the reason why SSRLabs' coprocessors address both parts of HPC: floating-point math and finding structure in unstructured data.
"Moore, Roger Moore". No, I am not Roger Moore, but I hold it with Einstein, Albert Einstein. He gave us lots of good science and many pieces of good advice, and one of my favorite ones is "make it as simple as possible, but not any simpler than that". George Haber may have fallen victim to oversimplification. I'll take the liberty to rephrase his observation to "Any problem that can be described algorithmically, through table lookups or a combination thereof will be implemented as a piece of software unless it can be executed more efficiently in dedicated hardware". If that is what he means then I can wholeheartedly agree because that is a true statement. That statement also leads directly to Kloth's Observation. If more and more software is being written, then (atomic) instructions will have to cover more and more of the tasks executed to save power and energy, reduce latency, and improve performance. MMX and SSE as well as H.264 video codecs are only one set of examples. Kloth's Observation can be summarized as follows: "Any sequence of instructions that is proven to be deployed to a degree greater than a preset threshold in a typical application mix is replaced by dedicated hardware, controlled by a Finite State Machine dedicated to this sequence of instructions, and invoked by a new instruction word. The threshold is determined by goals for performance and latency as well as by the need to reduce power and energy consumption". If Haber's original Observation ("If it can be done in software it will") were true, then we'd be back to the original Turing Machine, where no hardware assist existed for anything. That would mean that our processors execute in, out, inc and bne (Branch on Not Equal). That's all that's required, but of course it's not practical. While we could build a processor like this today in very few gates and with today's SiGe or GaAs process technology probably in the range of 60 - 100 GHz, it would not be practical, it would not perform well, and its power consumption per (compound) instruction executed would be well above the threshold that anyone would accept today. Instead, we keep adding instructions and dedicated hardware to an instruction set of a processor, and sometimes we even need an entire new class of coprocessors for specific tasks because the memory bandwidth of a CPU might not be compatible with the needs of that coprocessor, and because there may be real-time requirements that for example a DSP can fulfill, and a CPU cannot. So while breadth of software is increasing, the same applies to hardware accelerators.
SSRLabs has been a Hybrid Memory Cube Consortium member for a good reason. It looks like the first HMC modules from Micron are going to hit alpha customers very soon. See Micron ships engineering samples of HMC modules. Large memory arrays with low-latency interconnects and very high bandwidths allow for efficient in-memory compute. This is not a novel concept, but for the first time it becomes feasible.
We all know that things come in waves. The new wave is apparently in-memory computing. Well, we have been advocating it for a while now, and that is in fact one of the reasons why we support very large memory arrays. Of course, that is not all. In-memory computing requires that processors and caches deal with data that has very little to no locality. Our coprocessor architecture is designed to work with that kind of data. Look for other processor suppliers and check what they do with data that shows no locality. If they show you cache hit rates to reassure you that their processors perform just fine, ask for the assumptions in the locality that went into those cache hit rates.
The issue of scaling keeps coming up, and many companies claim that their products scale linearly with the numbers of processors deployed. That is not the case. Linear scaling will only work on chipscale networks; any scaling exceeding chip or die boundaries will suffer from proportionally increased intra-chip communications latency. While this can be mitigated to some degree by a coarser granularity of the tasks that are offloaded to coprocessors, a coarser granularity also means that the number of tasks that can be offloaded or scheduled for processing is diminished. Thus, scaling really has more contributing factors than those that are currently covered by any competitor.
There was a great article in Computerworld on August 2nd, 2013. The headline read that Qualcomm calls eight-core processors 'dumb' and that Rival MediaTek plans on releasing an eight-core processor in the fourth quarter (of 2013). I think that Michael Kan reported very well on this. Essentially, Anand Chandrasekher, who used to work at Intel, Microsoft and now is an SVP at Qualcomm, stated that the user experience relies on something other than multi-core processors. I am not sure what that would be, but he seems sure that he knows what it is. Qualcomm is a great company with great products, and they make cell phone processors with up to four cores. I am not sure if Anand thinks that a quad-core processor is all it takes and needs for a cell phone, but I disagree with that assessment. We use our cell phones for many more things these days, and as a result, they need to have more compute power. One suggestion is to improve single-core performance; the other is to deploy more cores. More cores - even in a cell phone - allow for better granularity of tasks and threads to be scheduled, and therefore help save power at any given performance with a constant or better peak throughput in those applications. In cell phones, there are limits of course due to the limited number of threads we need to run, but it's not four or eight cores. I'd think that 64 would probably be a good number - even for superscalar processor cores. Anand's track record of correctly predicting trends is not all that good. He often got things wrong. This one, however, will be noted as one of the big false assumptions. I think that in general, broad all-encompassing statements usually are wrong - long-term anyway, but oftentimes even within the lifetime of the person stating it. I know that a former CEO of IBM estimated the need for about 5 computers worldwide. I guess that number was slightly off. Also, Bill Gates stated that no one would ever need more than 64 kByte of DRAM. We know that this was wrong. With the Internet of Things arriving, we will see home gateways with enormous numbers of cores to control all of the sensors and actuators in our homes - and they all will be controlled and setup, maintained and supervised by our cell phones with high-resolution screens, high-speed WAN and WLAN connections, and lots of local and remote compute performance at their disposal. In other words, they'll use lots of processors cores.
In an earlier blog entry, I had said that sometimes it's not easy to figure out what to do in parallel processing, but if you can draw from your experience of what NOT to do, that sometimes helps by the simple laws of exclusion. If there are n possibilities for something to be done, and you have tried n-1 before, then there is only one left. Sure, that strictly applies only if all other circumstances are the same. However, most things surrounding your design issues don't change that drastically. Current conventional DRAM still is a limiting factor as it was 30 years ago. Disks are still slower than DRAM, even if you consider Flash-based disks. I/O tasks in general still take longer than compute. With that in mind, I have gone through a huge stack of designs I was involved with, have done as the single designer on that project, or have been contributing to. It may sound like nostalgia, but it is not. It is useful to be reminded what not to do. Here is the list, in no particular order; some done at my alma mater, CAU Kiel, others at any one of my companies. An 8088-based PC to an 80186 conversion: it was not worth it. Design of a server with coprocessors (earliest Intel idea of heterogeneous compute): 8086 with one 8087 FPU and two 8089 I/O Coprocessors. It worked well, was faster than most of the later 80286 machines, but prove expensive due to us having to write and modify device drivers to work with the 8089s. 80186-based "intelligent offload" Ethernet NIC with Intel 82586 allowed us to connect an 80286-based server to a Cray Y-MP without falling victim to frequent overruns (that was before TCP and UDP and other reasonable means of handshake). Our inmos T212 array with inmos C011 and later C012 Link Adapters to accelerate our HPC was less successful than we hoped for due to lack of floating-point performance on the T212. The same held true for its successor, the T414. After SGS Thomson bought inmos, the T800 was promised to cure it all. Well, it did not, and so we switched to Intel's 80860XR and XP. Lesson learned: did not scale well, and a GPGPU does not make for a good CPU or an accelerator. Then we tried true accelerators, such as Weitek's 3167 and 4167. They did accelerate the main CPU in some floating-point heavy applications (effectively only for scalars, which limits even today's coprocessors including GPGPUs), but by far not enough to make it worthwhile rewriting practically all of the software that we had in house. By this time, we had blown through quite some considerable amounts of money, and the Dean ordered us to only build machines or clusters from commercially available hardware. Since no clustering software existed back then, performance dropped, and once that software started to appear, our physicists had to rewrite their software for analyzing star atmospheres anyway, so no real cost savings were to be had except for that we only had to deal with software development expenses. We tried out and benchmarked a few Convex, Tandem, HP PA and x86, Sun SPARC, DEC Alpha and a few other machines, only to come back and create clusters all over again. We found that local ccNUMA on SMP machines with non-coherent global memory scaled best at the best performance per $. At some point in time, Intel re-invented the concept of Intelligent I/O which in the second go-around relied first on Intel's own i960 and then on the IOP 3XY series processors (which then were named XScale and in reality was the StrongARM product line acquired from DEC), and it actually did work well. Problem was that it was not x86, and so it withered away. The problem of the computational and particularly DRAM performance of the individual node remained. While we did no more hardware development, we were unsatisfied with the aggregate performance. That led me to leave and my company to resume ASIC development. The reason is simple: access to a local register is possible in one clock cycle. Access to a remote register on a chipscale array of processors is a matter of a few cycles. Access to anything remote that is not on-chip will cost hundreds, if not thousands of cycles both in terms of latency and in actual cycles a CPU must expend to facilitate that communication. The challenge is to provide a coprocessor that facilitates that acceleration, and to make sure that the effort to rewrite software is minimized. That is exactly what SSRLabs is doing, so I came back full circle.
We have set up our Facebook page just to make sure that those who don't use more traditional search engines and those that don't use a portal can find us easily as well. We noticed that searches for coprocessors or accelerators for numerically intensive and "Big Data" applications to instill some structure into unstructured data does not seem to lend itself to proper search engine rankings, so the Facebook page may help. Let us know if that works for you.
We have continued our ASIC, firmware and software (mostly API) development and noticed that CUDA seems to be diminishing in importance and openCL is gaining in popularity. We are not quite sure how that happened since openCL always was more flexible and offered better support for heterogeneous compute, but CUDA got all the press. That seems to have changed, and we are glad to see it did.
Jack Dongarra has published preliminary results of the current top of the top500 supercomputers, the Tianhe-2 in China (more here: Jack Dongarra's original paper in PDF, Wikipedia and PHYS.org reporting and commenting on Jack's paper). I think that this might spell the end of GPGPU compute - which I personally would appreciate. Intel's Xeon Phi decimated the GPGPU-accelerated machines by a wide margin. The Xeon Phi handily beats them both in absolute performance and in GFLOPS/W. Heterogeneous compute will replace the current GPGPU-accelerated paradigm.
I thought some more about the implications that SSRLabs' Technology Advisory Board discussion has, and I came to the conclusion that we have not even started to see the benefits that SSRLabs has as a provider of coprocessors versus the suppliers of host processors. A host processor needs to support a variety of Operating Systems, accessories, input devices, output devices, file systems, device drivers, APIs, User Interfaces and data structures. I am sure I omitted a whole laundry list of items here, but all of them directly influence the development effort, both in hardware as well as in software. That is not the case for a coprocessor, since all it needs to support is an API on a host, and its internal Operating System, memory management and task scheduling. These few items have no substantial impact on development time and complexity, and as such, a coprocessor can be much more efficient. The coprocessor can use the host as an I/O front end, and as such does not need to have any knowledge of the hosts' Operating System, file I/O, networking stack or even the hosts' CPU type.
SSRLabs' Technology Advisory Board had a virtual e-meet and discussed a whole string of interesting topics with regards to parallel processing. Topics discussed were memory access, bisection bandwidth, instruction decoders, MIMD and SIMD, and whether saving power necessarily means saving transistors. As usual, I learned a lot from my TAB members, and I think we are on a good trajectory to convince the industry that what we are doing is the correct strategy to allow for scaling. The discussion also highlighted that there is a profound difference in a general-purpose host processor and a coprocessor. Luckily, coprocessors are a lot simpler, because they don't need to span the breadth of requirements that host processors need to span. That makes them much easier to design, to verify and to deploy.
It looks as if the funding situation for semiconductor companies improves, particularly for those companies that can show that they don't need $300M to get to the finish line. There is a pretty good guest commentary on VentureBeat from Rob Ashwell. The title is "Will trend for low cost chip development bring back the investors?", and it was posted on May 11.
I again received a lot of emails asking for clarification of my latest entry. Most questions revolved around hardware versus software. The explanation is actually quite simple again. A processor can't have an instruction or a hardware accelerator for every imaginable computational problem. In fact, the advantage of a general-purpose processor is that it is flexible such that it can execute a variety of different computational problems with only a very few instructions. That is extremely useful because it allows anyone to test out methods such as video and audio encoding and decoding to check if an idea or application is useful for the end user. If it is and it is adopted widely, then it makes sense for processor designers to gradually and over time implement these algorithms within hardware-accelerated coprocessors to augment the host CPU. Some of these coprocessor functions may over time migrate into the host CPU because they augment the instruction set. However, that would be adding complexity and size to the host processor. Due to better efficiency of those algorithms being implemented in hardware, they save power over a software implementation in the system - even though they may increase power consumption of a component or add a new component that was not there before. That is a decision the host processor designers will face when implementing incremental hardware into their CPUs. Sometimes, it pays off to do so, and sometimes it does not, and the incremental functions will stay part of a coprocessor. Some algorithms can easily be implemented in hardware, but their implementation is so different from the needs of the host CPU that they can't be incorporated into the host CPU. An example for that is the graphics card. Neural net and floating-point coprocessors are another set of examples.
Mentor's User2User Conference on 4/25/2013 in San Jose was quite a confirmation of our views. Wally Rhines' keynote speech captured many of the issues that we have stated over the past year. Hardware and software is rarely co-designed, and even if it is, in very few cases the system architect looks at how to save power by executing often-repeated instruction sequences in dedicated (incremental) hardware. A simple example is Steve Jobs' reluctance to support Adobe Flash: Adobe's Flash is not standards-based, and as such no dedicated hardware exists. Therefore, it has to be executed in software on the host processor. That takes more energy, and battery capacity is drained faster. Steve Jobs did not want to accept that and pushed for standards-based H.264, for which ample power-efficient hardware accelerators exit. As a result, Apple's iPhone does not execute software decoding Flash-encoded videos, but happily plays H.264-encoded videos. Wally Rhines' keynote called for better integration of hardware, software, firmware and system design - with the appropriate tools, of course. We go one step further and postulate that for a large class of problems, a hardware implementation is faster, more energy-efficient and in general performs better towards the expectation of the users. SSRLabs' coprocessors are targeted at offloading host CPUs of tasks that can be executed faster and at a better power-efficiency than that of the host CPU itself.
I have read a bunch of data sheets for competing floating-point coprocessors, and I must say, I find them not very believable. If a used car salesman tries to sell you a car and tells you that the engine of that car provides 1000 HP and that the car goes 180 MPH, and that at idle it gets 30 miles per Gallon, would you believe it? Particularly if you see that the only drive wheel looks like it's stolen off a bicycle? Probably not. On top of that, at idle the car goes nowhere, and so it will use up fuel without going a single mile... But that's what I see often. Why? Check the difference between the claimed numbers of FLOPS and compare that to the bandwidth from and to DRAM. Let's take a simple yet unrealistic example to prove the case. Let's assume the processor has a set of 1000 superscalar Fused-Multiply-Add (FMA) engines that deliver one multiplication and one addition result per clock cycle at 500 MHz (resulting in 1 TFLOPS). So we get one result of the type d := a * b + c every clock cycle per engine. We feed in four pieces of data (including the instruction) and get one out. Let's further assume that all those data pieces are of the 64-bit floating-point type, and that the instruction is encoded in an 8-bit instruction code. So to feed this beast, we need to provide three 64-bit words and one 8-bit instruction per cycle, and we must write back one 64-bit result per cycle. That is an asymmetric DRAM bandwidth of 200 bits in and 64 bits out per cycle (remember, the mult and the add are two FLOPS), and if we assume that we have two floating-point operations per cycle (the add and the mult), then we have 200 bits per 2 FLOPS in and 64 bit per 2 FLOPS out, or 100 bits per FLOPS in and 32 bits per FLOPS out. Let's now look at the claims that are often made: 1000 GFLOPS or 1 TFLOPS per processor at 300 W, and a memory bandwidth of below 200 GBytes/s. If our example holds true, then that processor must have a memory bandwidth of 100000 Gbit/s into the chip and 32000 Gbit/s out of it. Expressed in Byte/s that is 12500 GByte/s into the chip and 4000 GByte/s out of the chip. Most GPGPUs have between 1500 and 2400 GPGPU cores to achieve that performance, and the cores need to compete for the DRAM bandwidth. Something is not right here. Above numbers can only be right if the algorithm that is worked on is more complex such that not every result gets put back into memory, and no new data has to be fetched from memory either. It even means that instructions fetches at that rate are impossible, so it clearly restricts the results to anything that can be solved in a SIMD (Single Instruction, Multiple Data) machine. While there are many data parallel problems that can be solved in a SIMD engine, I do not see the processor suppliers state this as a restriction. They rather claim that the performance is referring to peak numbers, and that real-life numbers are "close" to those peak numbers. As we have proven above, that is not the case. Real-life numbers will be a small fraction of the claimed performance. Even experienced parallel programmers usually don't manage to write a SIMD version of a matrix multiplication. That is why SSRLabs' coprocessors are different. Our memory bandwidth is not anemic compared to our SIMD and MIMD parallel performance. We have balanced the coprocessor subsystem to make sure that simple data-intensive algorithms executed in a SIMD fashion as well as complex MIMD task-parallelism won't overload any part of the coprocessor or the memory subsystem.
EE Times has caught up with the Hybrid Memory Cube Consortium and written up a pretty good article on why the semiconductor industry needs new memory subsystems. While the article does not quite go far enough it makes the case that we have made for close to two years, and for which we have a solution. We use HMC memory modules for our coprocessors, but our advantage does not end there.
Moore's observation keeps getting misinterpreted. Gordon Moore originally observed that due to improvements in fabrication processes every 24 months the number of transistors that can be placed on a die doubles. Since chip design was less expensive than chip fabrication, that observation meant that every two years the cost per transistor was cut in half. That's purely an economic view of the trajectory the semiconductor industry was on. There is no physical law that says that transistors have to keep getting smaller. In fact, there is a limit to the size of a transistor. But that's beside the point. Moore's Law (or better: observation) - if applied to the cost of a transistor - has long ago stopped scaling. With the design cost of a chip ballooning the fabrication cost is not the determining factor, as can be seen if we compare DRAM costs with costs of an ASIC. As of today (DRAMXChange) a 4 Gbit DRAM costs $3.45. So over 4 billion transistors and over 4 billion capacitors (or a total of about 9 billion components) resell for not even $4. The largest FPGA with less than 7 billion components costs substantially more than $1000. Why? Design engineering costs dominate ASIC cost. Fabrication is a small portion of the total cost. If Moore's Law was correct and applied universally, then the DRAM should cost the same as the FPGA, and it does not. Thus, Moore's Law is not universally applicable, and it has stopped being economically true a long time ago. Silicon Valley and the semiconductor industry still thrive. More than Moore is only possible if we look at processor performance and re-evaluate things properly. Traditional CPUs don't scale. SSRLabs' coprocessors scale.
The current discussion of "More than Moore" is very entertaining, to say the least. It revolves around the notion that Moore's Law at some point in time will end, and that point in time may be now, or in a few years. We are at a point at which we can squeeze several billion transistors onto a reasonably-sized die. I'd rather suggest doing something useful with these transistors than trying to figure out how to keep cramming more and more of them onto a die, and then after the fact complain that the resulting chip consumes too much power and does not scale in performance. I'll revisit this in more detail in a later entry.
With regards to single-core versus multi-core and heat dissipation the solution is actually quite simple. On current traditional single-core and multi-core processors, we have observed that between 25% and 75% of all transistors (and therefore area and power consumption) are in the Cache. Caches are used to mask the latency and bandwidth differences between the processor core and the external DRAM. In applications such as supercomputers and HPC or large-scale clusters, a lot of messages are exchanged through MPI, using external memory (DRAM). That is completely unnecessary if all cores are small and are located on one single die, with direct interconnects and a large bisection bandwidth. As a result, most power consumed in many-core (or massively parallel) processors goes into productive tasks, versus most of the power within single-core processors being consumed in Caches that are unnecessary if we just look at MPI. SSRLabs proposes massively parallel processors with large on-die bisection bandwidths.
As promised, here are a few more answers. The first pertains to the x86-64. I refer to Intel's 64-bit x86 architecture as x86-64 instead of x64 as many others do. Intel has had many 64-bit architectures over time, Itanium being one of them, and of course the infamous i80860. Since x86-64 is a derivative of the original 16-bit x86 architecture that over time has been extended to support 32-bit addressing and now 64-bit address schemes, I think that the AMD-developed 64-bit extensions should be referred to as x86-64. x64 is plain misleading. I like to be precise; if Intel choses to be sloppy, then that is their problem. The second answer is for the instruction-efficiency and power-efficiency question, and to what those terms mean. Power-efficiency does not necessarily mean low power consumption. Power-efficiency is the quotient of MIPS or GFLOPS per power unit used. So if a processor consumes 100W but delivers 1 TFLOPS of 64-bit performance it is more power-efficient compared to a processor consuming 5W and delivering 20 GFLOPS of 32-bit math. Those numbers are published if the processors in question were evaluated with LINPACK or any similar tests for I/O or AI applications. Instruction-efficiency is more difficult to explain. For coprocessors, it usually does not make sense to send a single multiply operation from the host to the coprocessor since the time it takes to send the data and instructions off and receive the results back may be longer than the time it would take for the host to execute those same instructions. Therefore, one would farm out longer sequences of instructions, such as a large matrix multiplication or an inversion, a Fourier Transform, or a set of trigonometric functions. While this can and should be done using openCL or similar APIs, it is crucially important that the coprocessor executes these functions with as few as possible native instructions so that SIMD and MIMD (I'll explain those later...) can be executed simultaneously on the coprocessor. It is also nice to know that legacy code can be executed with simple translation tables. Another item came up reading an article about HMCC FLASH on EE Times. One more time, our insight based on our long-term vision has proven to be correct. The Hybrid Memory Cube Consortium laid the foundation for this. HMCC memory will allow for even more layers of tiered memory; we will have extremely large FLASH-based HMCC memory on top of the faster, but smaller HMCC DRAM. I'll explain later and in more detail in my book why this is important.
I again received a whole lot of good questions (see below). I will answer them over the next few days, starting with the power density question today. Power density is the power consumed and heat created per volumetric unit. In other words, it is a measure for how much heat is generated in a certain space times its height. If your Intel Core i7 processors dissipates 150W and it has a footprint of 4" * 4" and is 6" high, then it dissipates 150W in a volume of 10cm * 10cm * 15cm or 1500cm3, or it has a specific power density of 150W/1500cm3 = 0.1W/cm3. In reality, the power density is much higher since the die is only 20 mm * 20 mm * 1 mm in volume; the additional volume comes from the need to spread out the heat and give it more surface area. If we assume that this is a constant for air-cooled processors, then the only solution we have to improve the performance per volumetric unit is to improve the ratio of MIPS/W and GFLOPS/W. The power density is important because data center operators need to be able to remove heat and still use off-the-shelf components. In addition to that, they usually like to stick with air-cooling. So far, not a lot of data center operators have dared to switch to water-cooled computers.
Today I want to comment on an article from February's ACM magazine ("Power Challenges May End the Multicore Era", Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger, Communications of the ACM, 2/2013, Vol. 56 No 2, p. 93ff) that was picked up and commented by Jack Ganssle ("Will Moore's Law doom multicore?") on Embedded.com. The authors surmise in their article that multi-core processors are not a solution to the thermal issues, but a problem. All underlying assumptions can be doubted. The first fatal mistake of this paper is to assume that the cores in a multi-core processor are created from x86-64-like processor architectures, and that need not be the case. The second mistake is to assume that the power density in a processor core designed to be deployed in an MPP is anywhere close to the power density of an x86-64-class processor. The third mistake is the presumed conclusion that a single-core CPU would do better. It is my firm opinion (and I will prove that in my upcoming book) that all of these assumptions are incorrect. An MPP will need to consist of smaller cores, with better interconnects, alleviating the need for large caches. That solves the problem of power, power density, heat and the difficulty to verify the correctness of the design. In fact, if we consider that today 6 billion transistors fit onto a decent-sized die in any modern process, then it is hard to believe that the authors indeed think that a single-core processor could potentially be a solution. We believe that Moore's Law implies that MPPs are the solution: Each core is easy to design and to verify, the interconnects become the fabric that ties it all together, and the specifics of the bisection bandwidths make the processor fit the computational problem. That's exactly what SSRLabs is doing.
- Anonymous 1: What is power density? Why is the power density important?
- Anonymous 2: Why do you always refer to instruction-efficiency and power-efficiency? What do those terms mean?
- Anonymous 3: What's an x86-64?
- Anonymous 4: I don't quite see why multi-core is any better or worse than single-core. Why is the article so wrong? It seems like heat in general is the enemy.
It looks like I am going to have market updates for the blog today. We have a strong confirmation of our market, both in terms of size and in the timeliness of our design stage. EETimes has had a few articles out that confirmed that the transition from Xeon to ARM is happening in servers on a massive scale. This is partly due to Intels' Atom not quite living up to expectations as a Xeon replacement. Another reason is that a lot of data center operators just don't need more than light processing, and for that the ARM architecture is good enough at substantially lower power consumption. Intel's perennial competitor, AMD, has decided to shift its focus away from x86-64 towards ARM processors for servers. With a projected tripling of the market there is space for AMD, Calxeda and others for the ARM-based host processors, and more than enough space for us for the coprocessors. The TAM is projected to triple.
I received quite a few more questions pertaining to our software (APIs and SDKs) and hardware solutions. First, the issue of the national plan to map the brain came up. Our neural net coprocessor is the perfect fit for that since it incorporates many features that the brain has. Therefore, modeling and emulating the brain will be possible. Second, a few emails popped up asking me when the processors will be available to the general public, and if they'd work with current host processors and current GPGPUs. The coprocessors are in the late stage of design; and they are not yet available for sale. They will work with any server that has a PCIe Gen3 slot, and they can work in parallel with any GPGPU and GPCPU. That's what we and the Computer Science community call heterogeneous computing.
A few more questions came up with regards to Amdahl's Law and Gustafsons' Law of parallel speedup, and how they pertain to the granularity of the problem. I will write a book covering how these are all related to each other, and how even Moore's Law ties in and enables linear performance growth.
I am going to answer a few questions that came up lately and that I have not answered yet. First, how does the GeekDot page relate to us? Well, actually quite simple. For someone to come up with a new idea and a new solution to an old problem that person needs to be aware of the old problems and old solutions - whether they worked or not. I have played and used just about every processor and microcontroller that is out there, RISC, CISC, single-core, multi-core, even DSPs in all of their guises. I have used centralized systems such as mainframes, client/server and fully distributed client-only systems. I have used and created hardwired logic, processors, FPGAs and PALs/GALs, and even Ferranti's Uncommitted Logic Arrays (ULAs). I think I know what's required, and how to solve it. That's how it ties in. As a physicist by training I know how to optimize a mathematical algorithm for fewer register toggles and fewer intermediate steps, and that gives it better performance and a lower power and energy consumption. Coprocessors of that kind will accelerate floating-point operations at lower power and energy consumption than current architectures, so you can get more performance out of your data center, even if you can't feed in more power or remove any more heat.
Coming back to some of the earlier comments made I'd like to offer some more explanation. With regards to Excel being fast and precise enough that might be the case for the casual user and smaller sets of tables. Excel might even use packed BCD (Binary Coded Decimals) - I am not sure, and this is really not what this is about. I know that banks don't trust most floating-point units, and that is usually for a good reason. They rely on IBMs POWER6/7 families for packed BCD that's reasonably fast. Assuming everyone in the US has one account on average, all banks together would need to go through 300 million multiply-adds per day if every account has one transaction on average per day. Easy. However, if I want to model the weather for a local or global forecast or if I want to run a simulation of the effect of a certain financial model, I'd like to have 300 M multiply-adds per second available to me to run many models and to weed out the ones that lose money. So I need precision and accuracy and speed and power-efficiency at the same time. That should be done not in the hosts' main CPU, I'd like that to be done in a coprocessor so it does not get disturbed by other "stuff". And that's exactly what we target.
I have received a few emails stating that current processors do well enough in floating-point applications. To some degree, and for some applications, that is correct. Most DSPs (Digital Signal Processors) support 32-bit integer or 32-bit floating-point arithmetic or a combination thereof - and for applications they run, that's good enough. Most scientific applications (weather forecast, car crash simulations etc.) require better precision, particularly if matrix multiplication is involved. The required dynamic range is much higher than what can be mapped onto 32-bit integers or 32-bit floating-point. Some financial applications and most importantly, nearly all financial modeling requires substantially better precision. Not clear? I will give you a simple example. Let's assume we have a processor that supports 32-bit integer math with so-called two's complement. Then it can compute numbers from -2147483648 through 0 to +2147483647 with a resolution of 1 (the quantization). If we need to be able to process single pennies, then the numbers go from -$21,474,836.48 through 0 to +$21,474,836.47 with a penny as the smallest possible increment or decrement. So the maximum range is about $21M in positive and negative direction. If your account shows more than $21,474,836.47 or less than -$21,474,836.48, the numbers get truncated (more elegant word for cut off). That's not good. Most banks deal with much more money on a daily basis, so they'll need much better precision. After all, the last thing you want to hear from your bank is that your account is empty because it fell victim to a rounding error or truncation. That's why most x86-class processors, the Itanium, IBMs POWER processor series and SPARC and Niagara support much larger dynamic ranges of numbers than 32 bit integer or floating-point allow for. IBMs POWER supports theoretically unlimited precision with packed BCD, and the others support arithmetic with 64-bit or 80-bit floating-point precision. However, they usually only support multiply and add (and subsequently subtraction). High-precision and fast division, logarithms and exponential functions as well as any kind of radix (square root etc.) are not well-supported - and those are the functions you would need to compute compound interest. Would you like your mortgage to go up due to a limited-precision floating-point unit when you refinance? I sure would not! Compound interest is very susceptible to small errors, and that is why it is so important to get that right. It becomes obvious if you write it out for your own mortgage (or any other interest-bearing account): N(T+30 years) = N0 * (1 + R)30 if R is your interest rate in percent, and N0 is your initial investment (or the initially borrowed money of your mortgage), N(T+30 years) is the end result after 30 years. The same way the interest compounds the error compounds. Our processor design enables multiplication, addition, subtraction, division, all exponential and all logarithmic functions in hardware.
- Anonymous 1: What seems to be the problem? I don't see an issue with precision or accuracy. Excel does not seem to create any faulty results.
AMD under Rory Read has understood the signs of the time. It is time to get past the x86. While I don't particularly like the ARM processor architecture (I will explain in a later blog entry why that is the case), one can safely assume that mobile devices, tablets, the successors to laptops and possibly desktops are going to be powered by ARM processors. Given that Calxeda and others have had success in building even servers from ARM processors, it is not a stretch of the imagination to assume that within a very short period of time the only remaining big iron will be IBMs POWER, and x86 will decline if not disappear. That leaves a void in servers, particularly in those used for High Performance Computing: Even ARM coupled with an openCL or CUDA capable GPGPU won't be able to provide the level of high-precision floating-point performance that will be needed. This is where SSRLabs can fill in the void: dedicated high-precision high-performance floating-point at levels of power consumption that a general-purpose CPU cannot deliver. Two recent articles point to this transition that is already in progress. One is Brian Fuller's excellent 10/19/2012 piece on "Silicon Valley Nation: AMD bids farewell to PC", the other article is a press release by Reuters that was picked up amongst others by JEDEC and SmartBriefs, "AMD hires chip veterans, diversifies beyond PCs".
- Anonymous 1: As far as I can tell, most modern General Purpose Processors provide enough floating-point performance. What is the problem?
- Anonymous 2: We need more floating-point performance, but we can't afford to get more power into the building and we can't get any more heat out. What do we do?
Rick Merritt at EE Times on 1/18/2013 wrote this great article "Facebook covets core-heavy ARM SoCs". It is an important message to understand. Facebook and a lot of other data center operators see one common trend today: Data is being stored and only lightly processed, and if it is processed, rarely any floating-point operations are executed. Throwing a CPU such as an Intel Xeon at the problem is complete overkill; the Xeon processor family (and other heavyweights such as Itanium or IBMs POWER7) are intended to execute floating-point operations at a decent rate, but at the cost of area and power. If all that floating-point silicon area is unused, it is going to waste in applications such as data storage and light integer processing. As a result, Facebook wants to disaggregate servers into those components that have similar half-lives. CPUs have proven to be updated about once a year, so the CPU should be swappable. And for their application profile, it's good enough if it supports long integers only; there is no need for massive floating-point units. That fits right into SSRLabs' basic assumptions. The host CPU should rather orchestrate work, and if there is a special need, just throw the problem at a coprocessor (preferably a local ccNUMA-type design). We are glad that Facebook and the Open Compute Project shared their work with the general public. We will build upon it.