| S | D | N  ···   THEMES.ORG · NEWSFORGE · GEOCRAWLER.COM My OSDN · PARTNERS · CONFERENCES  
Open Logo

 
 
Current Content Advertising Info Our Sponsors About Us Site Map/FAQ Contact Us

=>

COVER FOCUS
SMP's place in the Linux equation
Cover Focus

By: Franco Vitaliano

The new Linux kernel raises expectations for scalability-always circuitous.


Whoops! There go the year-end bonuses: "It's possible that some of our sales force overstated the benefits of dual-capable CPU systems. Unfortunately, by being overly critical of single-CPU-capable systems, if they have to eat some crow, so be it, but we do believe this (single-CPU) system has value." This rather embarrassing admission from an "I-don't-want-to-be-named" executive of a prominent PC-workstation OEM followed Intel's revelation that the first Pentium 4 chips would include circuitry to simplify their use in multiple CPU configurations. So short of an OEM effort of Homeric proportions, the first P4 CPUs will be limited to uniprocessor configurations.

For some time, PC OEMs have differentiated their high-end, high-margin workstations from desktop PCs by asserting that heavy-lifting duties demanded dual-chip symmetric multiprocessing (SMP) solutions. Unfortunately for our anonymous PC OEM executive, Intel will not be able to produce an SMP-compatible P4 CPU until the second half of 2001, or possibly even later. Nonetheless, some wags attribute this delay to an Intel Xeon-marketecture gambit to keep SMP-capable CPU prices artificially high. For now, the market spin is that the chip's 1.5-GHz clock speed and 400-MHz system bus make the P4 so powerful that a second CPU in a workstation is unnecessary.

Meanwhile, all this P4 posturing will keep many crafty geeks busily chuckling as they merrily construct screaming home-brew SMP workstations using motherboards that support dual Intel Celeron CPUs. They've discovered that the new "A" series Celeron CPU sports an on-die Level 2 cache, which was not part of the original Celeron design and which elevates the lowly Celeron into a solid platform for doing Linux SMP chores.

More importantly, these P4 machinations open the door for rival AMD, which has finally unveiled its multiprocessor plans. AMD's SMP DDR system architecture provides both two-way and four-way building blocks using the 266-MHz Athlon bus and the new AMD-760 MP chipset. Beyond providing for SMP configurations, the 760 MP chipset supports ATA100 disk drives and DDR SDRAM. With OpenBench Labs tests demonstrating that AMD processors on average perform 20% faster than their Intel equivalents, this is great timing for doing a torrid SMP samba with the new Linux 2.4 kernel.
DOUBLING DOWN ON THE CHIPS

It's not HAL, but chip multiprocessing (CMP) may prove to be the first hot technology of 2001.

Taking multiprocessing a step further down the evolutionary path is the hot new chip multiprocessing (CMP) technology.

The fundamental breakthrough for CMP is in puttingmultiple processor cores on a single die while still leaving real estate for adequate instruction and data caches. As a result, the new CMP systems should exhibit very fast cache-coherency communications between processor cores on the silicon.

While CMP is still in the labs at Compaq, HP, IBM, and Sun, IBM design disclosures reveal that its Power4 architecture will sport up to four processors, along with second- and third-level RAM caches. Early Power4 chips will use bus speeds of 500 MHz, which will double before the end of 2001. Technologically, this is all very hot stuff, but for bottom-line-challenged PC manufacturers, it's the economics of CMP that really sizzles.

To design an SMP system for two or more processors that will share memory, PC manufacturers can take advantage of some specialized circuitry in Intel's CPU that already simplifies matters for the operating system. These design facilities include hardware cache coherency and built-in interprocessor-interrupt handling. In addition, Intel's Multi-Processor Specification (MPS) 1.1/1.4 defines a detailed configuration structure in ROM that the boot-up processor can read to find the full configuration of the processors and buses. This document also specifies how shared memory must function in software.

For the PC OEM, decisions about shared-memory architecture are not only complex, but also very costly in terms of performance. Far and away, the most common memory model is that of Uniform Memory Access (UMA) across all of the processors. A few have dabbled with Non-Uniform Memory Access (NUMA), which assigns banks of memory to certain CPU groups and forces a CPU not in the group to generate an interrupt to access memory out of its main bank. A lot of costly extra design work and circuitry are needed just to offer an SMP option. In turn, that overhead raises manufacturing costs and pressures price-sensitive buyers to drop their high-margin workstation dreams and go shopping for low-margin desktop PCs.

Not surprisingly, a lot of attention is being focused on the consumer side of the new Linux kernel. This gaze has centered on support for the SuperH processors used in Windows CE devices; Universal Serial Bus (USB) devices including keyboards, mice, and speakers; 32-bit PC Cards (PCMCIA); along with support for ISA plug-and-play devices. Nonetheless, a significant number of changes in Linux 2.4 can be described as enterprise-level.

These additions are not exclusively related to improved SMP performance; however, many improvements certainly relate to this aspect of enterprise computing. For example, the Linux 2.4 kernel now supports 64 GB of RAM on Intel hardware, which provides more than enough shared memory for SMP scalability. The way Linux handles shared memory has also been changed to be more standards-compliant and to simplify matters for developers. Under Linux 2.4, creating shared-memory segments is eased via the mounting of a special "shared-memory" virtual file system. This kernel feature will be handled by the various Linux distributions.

The scheduler has been revised to be more efficient with a larger number of concurrent processes. Therefore, as a multi-user server, Linux 2.4 should prove to be more scalable on multiprocessor systems. [For more about multiprocessing, see accompanying sidebar, "Doubling Down on the Chips."]

In addition, facilities have been provided to systems administrators to configure the process limit constrained only by the amount of memory in the system. Reports tout high-end servers running Linux 2.4 as able to support as many as 16,000 processes with as little as 512 MB of RAM installed. In theory, the new Linux kernel can handle about 4.2 billion users in its 64-GB memory space.

Just in case those 4.2 billion users need to access really big files, the 2-GB file size restriction has also been lifted. Given all of these high-end changes, it should come as no surprise that the entire RAID subsystem would be rewritten as well. On SMP systems, the RAID subsystem is better threaded to offer much improved-and sorely needed-throughput performance. In addition, a logical volume manager (LVM) has been added to dynamically add, remove, and reallocate disk space without bringing the system down.

Getting to those resources over the network has also improved. Linux 2.4's TCP/IP stack has undergone a complete rewrite. Linux stack performance has now gone way up and the stack has also been made far more reentrant. Prior to 2.4, Linux serialized many TCP/IP activities, hurting performance and limiting scalability on multiprocessors.

With all this SMP activity underway, you can expect a lot of vendor hype and confusion. The biggest danger in all of the ensuing PR fallout is that "scalability" and "speed" may end up equivalent in some users' minds. These two terms are emphatically not the same: Scalability refers to the effort necessary to find a solution as the difficulty of the problem increases.

The problem that both SMP and clustering first attempted to solve was how to scale an operating system as the number of interactive, terminal-based users increased. Assuming that systems designers got the memory access and cache-coherency problems solved correctly-which is a monumentally huge assumption-load balancing of end-user applications via SMP presented IT organizations with a transparent solution: There was no need to change either existing software or programming constructs. IT could treat an SMP system as if it were a uniprocessor and never worry about optimizing individual applications.

While this was an important solution for systems like VMS in the mid-1980s, the problem does not map well onto today's computing environment. In the 1980s, the constraints were the high cost of CPUs coupled with their low power. Following the launch of the VAX 11/780 CPU, it took six years to design a CMOS chip with similar processing power. The introduction of the MicroVAX II in 1985, however, set the clock ticking on Moore's Law-microprocessor power doubles every 18 months. Fifteen years later, mainstream CPUs sport 210 times the power of the original VAX; the P4 will expand that out to a factor of 2,048 times.

The constraint on today's IT decision-makers is finding a way to deliver high availability (HA). The goal no longer is to cram as many users as possible onto a single box to save money, but rather to keep an application available continuously in the 24/7 Web-enabled e-business world.

"No single point of failure" is the new IT mantra. As a result, HA clusters that balance loads over multiple servers running the same limited application set are sprouting up. Indeed, Yahoo, arguably one of the single most successful Web sites, uses lots of cheap Intel uniprocessor boxes running FreeBSD to reliably serve their many millions of users. For SMP to make sense in this new environment, individual applications must scale on an SMP system.

What made timesharing eminently successful on SMP machines was the fact that many-tens and possibly hundreds-of independent user applications were running simultaneously. It was then up to the operating system to load-balance these applications over all of the available CPUs. The analog to this for today's environment is an application that can be broken down into a number of independent simultaneous tasks.

Database systems fit this profile well and therefore have the potential to scale on SMP systems: The issue is whether or not the database has been architected with SMP scalability in mind. Calculation-intense applications rooted in matrix algebra are another application class that scales well on SMP systems. Once strictly the domain of NASA or FermiLab, these numerical techniques are making their way into econometric business-analysis models and On Line Analytical Processing (OLAP). For the programmer, the trick is to treat matrix calculations as parallel-vector operations and not as sequential scalar calculations.

All of the foregoing is just another way of saying that high-end desktop workstations will represent a significant source of demand for SMP systems. More importantly, the implication for effectively exploiting any SMP architecture in a workstation is that a lot of cranial cells will have to be exercised to make programs SMP-aware. From the outset, sophisticated parallel-processing algorithms must be developed. Implementing these algorithms in code is no less complex as program design constructs must be employed. [For an arresting update of cranial efforts toward algorithm designs and problem-solving, see related sidebar, "Easy Solutions the Hard Way," p. 48.]

With algorithms attempting to decompose a problem into numerous simultaneous tasks, programmers will have to be expert in the use of threads and shared memory as a means for two or more processes to share a segment of memory to communicate via the sharing of all ("shared everything") or some ("shared something") data structures.

To share information between heavyweight processes-including child processes spawned using fork()-shared memory is an essential construct. Using shared memory involves uniquely naming the segment, specifying access permissions using the traditional read/write/execute scheme, and dealing with race conditions. When multiple CPUs in an SMP box are executing the same code out of the same shared memory, memory locks are necessary to guarantee exclusive access to sections of the code that must be executed atomically-that is, as an uninterrupted unit. The SMP lock/unlock mechanism is actually an implementation of semaphores, with appropriate queuing of requests when a resource is busy.

Another mechanism for breaking processes into discrete tasks is through the use of threads. These lightweight processes have been defined in Linux by Linus Torvalds as a "context of execution." While there are user-level and kernel-level threads, only kernel-level threads can be used to take advantage of SMP. In particular, each task gets a table of threads and the kernel schedules each thread within the time-slice of each process. If properly coded-as when you deal with shared memory, the issues of deadlocks and race conditions are an exercise left for the programmer-a threaded application can automatically take advantage of an SMP environment and run incrementally faster with each added CPU.

Linux application developers typically use a POSIX Pthreads package to enable its SMP shared-memory capabilities. Unfortunately, the POSIX API doesn't require that threads truly run in parallel on your nifty multi-CPU Linux hardware and older C libraries are not thread-safe. Newer Linux distributions, however, include LinuxThreads, a pthread library made by Xavier Leroy, which is integrated with GNU LibC (glibc2).

As a result of these advances in the Linux kernel and GNU LibC, SMP has well and truly arrived. With a little bit of luck, ordinary users around the world will soon be rejoicing in the multithreaded chorus.
EASY SOLUTIONS THE HARD WAY

For hardcore computer scientists, the questions are: "What can be solved?" and "When can you solve it?"

Two disciplines in computer science-dubbed computability theory and computational-complexity theory-have sprung up, devoted to the understanding of which problems can be solved on a computer with finite resources (memory) in a reasonable amount of time as the scope of the problem grows. Another way of stating this is that some problems are hard or intractable because an algorithm with a finite number of instructions cannot be discovered that guarantees finding a solution in a reasonable amount of time.

Given a description of a particular problem, a number of generic questions arise around how to solve that problem:

  • Can an algorithm be designed to solve the problem?
  • Is it possible to design an efficient algorithm?
  • Can it be determined if the algorithm is optimal?

Computability theory is concerned with identifying the class of intractable problems for which no effective algorithm exists. Computational-complexity theory is concerned with identifying solvable problems that are intractable because no efficient algorithm exists. Out of this study, a classification method was devised to address the issue of tractability.

Pspace is defined as the class of problems for which answers can be found with computer resources-i.e., memory space to store instruction steps-that are polynomial in the size of the input. In other words, the amount of memory needed to solve a problem of size n can be expressed as a polynomial function of nk for some k =0, which is denoted as O(nk). No constraints are placed on run time: The only constraints are on computer resources.

Problems for which answers can be verified by an algorithm whose run time can be expressed as a polynomial in the size of the input are called NP. NP is a subset of Pspace; it is thought that NP is a proper subset-all NP problems reside in Pspace. The interesting aspect of this definition of NP is that it requires only that a problem's claimed answer can be either verified or refuted quickly. There is no requirement-or even an implication-that the answer can be found quickly.

Problems for which answers can be found by an algorithm whose run time is polynomial in the size of the input are called P. P is a proper subset of NP-all problems that can be solved in polynomial time can have their answers verified in polynomial time.

The most interesting subset of P is the class of problems that can be solved by fast algorithms whose run time is O(log kn). This subset of P is called Log Space, also known by its nickname of NC. These problems scale well and you can expect significant improvements in execution time as you add more processors.

Problems that do not have a fast algorithm for solution are called P-Complete. The most difficult problems, however, are those that are NP-Complete. These are NP problems for which there are no known algorithms for solving the problem that run in polynomial time. Moreover, all known algorithms for solving the problem run in exponential time-i.e., run time is O(kn). NP-Complete problems are intractable. You could add as many CPUs as there are stars in the universe and the solution still won't run properly.

Although this is a rich area of study, getting definitive answers to what is tractable (NC) and what is intractable (NP-Complete) has proven to be enormously difficult.



<  | Last Rage  >

 
Free Subscriptions Free Subscriptions!

OpenBench Labs OpenBench Labs
COMING SOON!
Free Subscriptions Subscriptions
Special Offer!
FREE 12 month subscription!



Use the Eclipse light to save your eyes after 6-hours of coding (or was it Quake?). Either way, this glare-free light will change the way you look at your monitor.
www.thinkgeek.com



CURRENT ISSUEADVERTISING INFOOUR SPONSORS
ABOUT USSITE MAP / FAQCONTACT US

Copyright ©2001 Custom Communications. All rights reserved.