************************************************************************ * Myricom GM networking software and documentation * * Copyright (c) 2001, 2002 by Myricom, Inc. * * All rights reserved. See the file `COPYING' for copyright notice. * ************************************************************************ README-linux for gm-1.6.3 README for linux distribution Supported platforms: Linux 2.2 and 2.4 for IA32, PowerPC, Alpha. Linux 2.4 for IA64 (Itanium). - For Alphas, if you have 2 GB or more of memory, we recommend kernel version 2.4.18 to install GM. You must use kernel version 2.4.14 or later (2.4.9 also works). Supported interfaces: LANai7 (PCI64, PCI64A), and LANai9 (PCI64B, PCI64C) If you have LANai4, you will need to upgrade your interface, or use a previous version of GM such as gm-1.2.3 for 256K and gm-1.5.2 for larger memory sizes. (Please also note that Linux 2.4 is not supported on gm-1.2.3). For installation instructions of an earlier GM version please refer to the respective README and README- files. WARNING: When building/linking GM applications, you must do so on a linux box that matches the OS version of the machine on which you will be running. You cannot compile on a 2.2.x machine and run the executable on a 2.4.x machine. Table of Contents: ----------------- I. GM Installation a. Configuring and compiling GM b. Installing the GM driver c. Running the GM Mapper d. Testing the GM installation II. Verifying the GM performance III. Running IP over GM IV. Improving IP Performance V. Fork() Support VI. Sample Scripts to automatically load GM and start the Mapper VII. Operating-system-specific Caveats a. Using Compaq Compilers for Alpha Linux (ccc cxx) b. PCI Chipset Tweaks c. APIC IRQ conflict on Tyan and AMD motherboards d. AGP (nVidia and ATI) conflicts VIII. Miscellaneous a. Uninstallation of the GM driver ************************************************************************ If difficulties are encountered, please consult the FAQ http://www.myri.com/scs/GM_FAQ.html and all technical support questions should be directed to help@myri.com. ************************************************************************ =================== I. GM Installation =================== GM installation is performed in the following four steps. 1. Configuring and compiling GM: --------------------------------------------- gunzip -c gm-1.6.3_Linux.tar.gz | tar xvf - cd {GM_HOME} ./configure make By default, we assume that the header file for your Linux installation is located in /usr/src/linux. If your Linux installation is not located in /usr/src/linux, you must configure with the following option: ./configure --with-linux= where specifies the directory for the linux kernel source. The kernel header files MUST match the running kernel exactly: not only should they both be from the same version, but they should also contain the same kernel configuration options. Note: If you have a mixture of hosts with LANai4 and LANai7 (or LANai9) interfaces that need to talk to each other, you must configure with --disable-new-features on all of the hosts. For a complete listing of all options to configure, type: ./configure --help Note: Do not use the configure flag --enable-directcopy. This flag is not a valid option to GM 1.6.3. It will be re-enabled in a future release. 2. Installing the GM driver: --------------------------------------------- Select an installation directory path . It is usually best for to be the path to an NFS directory available on all machines that are to share this GM installation. The directory must be accessible using on all machines that are to share the installation. must be an absolute path; it must start with "/". However, may contain symbolic links. cd binary ./GM_INSTALL You may omit to install the driver in /opt/gm/. Next, you must run su root /sbin/gm_install_drivers on each machine to install the drivers on that machine. If you wish for the driver to auto-load an boot, you must create appropriate links in the /etc/rcN directories to the /etc/init.d/gm and /etc/init.d/myri scripts. Alternatively, you may start and stop the drivers manually using su root /etc/init.d/gm start /etc/init.d/gm stop or su root /etc/init.d/gm restart to start, stop, or restart the driver, respectively. For directions on how to uninstall the GM driver, refer to the "Miscellaneous" section. Note: If the host is rebooted, you must reload the GM driver (and rerun the GM mapper). There are sample scripts, contributed by a customer, in {GM_HOME}/drivers/linux/scripts for loading GM and running the mapper at reboot. 3. Running the GM Mapper ------------------------ Myrinet is a source-routed network. I.e., each host must know the route to all other hosts through the switching fabric. The GM mapper automatically discovers all of the hosts connected to the Myrinet network, computes a set of deadlock free minimum length routes between the hosts, and distributes appropriate routes to each host on the connected network. Loopback and point-to-point network topologies require that gm_simpleroute must be run instead of the GM Mapper. (Refer to the GM README and the FAQ for details.) For a switch network topology, the GM Mapper must be run before any communication over Myrinet can be initiated. Further technical details about the GM mapper can be found in mt/README. Depending upon the user's needs, there are three different ways in which the GM mapper may be used. MAP_ONCE mapping: ---------------- The first way is by far the most common, and we shall refer to it as "map_once". In this method, the mapper is run on one host in the network (any of the hosts). It is rerun if a host (re)boots or a hostname is changed or after a change of Myrinet topology (swapping of ports on a switch). (If the Mapper must be rerun for any of these reasons, it is strongly advised to run it on the same host.) The command for this method of running the GM mapper is: cd {GM_HOME}/binary/sbin/ su root ./mapper ../etc/gm/map_once.args STATIC mapping: -------------- The second way in which the GM mapper may be used is called "static mapping" or "file mapping". In this method, an active mapper is run once when ALL of the hosts are up and running the GM driver. This initial active mapper will generate a map file and a host file. These files are then copied to all of the hosts in the network, or shared by NFS. An entry in the boot scripts will allow each host to read the map file and the host file and update the routing table on its local Myrinet interface(s). This method is particularly appealing as no human intervention is needed and no traffic is generated at boot time. The commands for this method of running the GM mapper are: cd {GM_HOME}/binary/sbin/ su root ./mapper ../etc/gm/static.args Copy the 3 files created by this command (static.map, static.routes, and static.hosts) to each {GM_HOME}/binary/sbin/ directory on each host if the gm tree is not mounted by NFS. Add the following command to the boot scripts of the host (scripts in /etc/init.d or /etc/rc.d/init.d). cd {GM_HOME}/binary/sbin/ su root ./file_mapper ../etc/gm/file.args HA mapping: ----------- The third way in which the GM mapper may be used is for the users who have a need for High Availability (HA) in an aggressive computing environment. The command for this method of running the GM Mapper is: cd {GM_HOME}/binary/sbin/ su root ./mapper ../etc/gm/active.args & It will continuously run the GM mapper in the background to detect and add any new hosts or remove any non-responding hosts, to detect any change of topology (change of slots in the switch, change of innerswitch topology), and periodically update the routing tables of the Myrinet cards (by default, every 30 seconds). You should note that this mapping method is quite intrusive. The user is strongly advised to avoid this method of running the GM mapper if his applications produce heavy network traffic (e.g., MPI applications) since the GM Mapper uses non-reliable messages that may be dropped in case of heavy contention, leading to hosts that may be marked as "non-responding" and removed because they are unreachable. A few expert customers use this mapping method to satisfy their high availability constraints for GM applications designed to handle a dynamic change of configuration (by design, MPI is NOT a fault-tolerant application). For the majority of users, the "map_once" GM mapping method is sufficient. For the users with more production-level constraints, the "static mapping" is the most adequate method. For fault-tolerant GM applications, the third method provides the best alternative. 4. Testing the GM Installation ------------------------------ A variety of test scripts are available in {GM_HOME}/binary/bin to test your GM installation. A README describing each of these tests can be found in {GM_HOME}/tests/README. We recommend the following five tests to validate your installation. cd {GM_HOME}/binary/bin 1. Test that the Mapper has correctly detected all of the hosts in your Myrinet network by typing the following command on several of the hosts: ./gm_board_info Note: In the output of this command, all hosts should be listed in the routing table of each node. If not all of the hosts are listed, then it is possible that a cable is not connected, or GM is not properly loaded on all hosts in the Myrinet network. A green LED should be lit up on the switch for each connection that is active. If you see *** No routes found *** in the output, this is an indication that the GM Mapper has not been run. (See README- for details.) When ./gm_board_info successfully reports a list of hosts, you can then run ./gm_allsize and ./gm_stress to test the network. 2. Test the basic connectivity of GM, by typing: ./gm_allsize --verbose --geometric on one of the hosts in the Myrinet network. Note: This loopback test will NOT work in a point-to-point (no switch) configuration. 3. Test GM bandwidth between two hosts, type (on the first host) ./gm_allsize --slave --size=15 and then type the following command (on the second host) ./gm_allsize --unidirectional --bandwidth --remote-host= \ --size=15 --geometric where is the name of the first host. These one-way tests are performed by running in slave mode on one machine and master on the node to be tested. This is done by adding '--slave' on the command line of the slave machine and '-h ' on the command line of the master where is the name of the machine running in slave mode. The name of each host is as specified in the output of ./gm_board_info. The --size parameter indicates the maximum length of message that will be sent, where 2^{size} is the value of that length. In this example, the maximum length of message sent is 2^{15}=32K. The --geometric parameter reduces the number of message lengths that will be tested. The default for gm_allsize is to test every length from 1 to 2^max_size incrementing one byte at a time. These tests take a long time to run, and generate data files suitable for input to gnuplot. 4. Test GM latency between two hosts, type (on the first host) ./gm_allsize --slave --size=15 and then type the following command (on the second host) ./gm_allsize --bidirectional --latency --remote-host= \ --size=15 --geometric where is the name of the first host. These one-way tests are performed by running in slave mode on one machine and master on the node to be tested. This is done by adding '--slave' on the command line of the slave machine and '-h ' on the command line of the master where is the name of the machine running in slave mode. The name of each host is as specified in the output of ./gm_board_info. The --size parameter indicates the maximum length of message that will be sent, where 2^{size} is the value of that length. In this example, the maximum length of message sent is 2^{15}=32K. The --geometric parameter reduces the number of message lengths that will be tested. The default for gm_allsize is to test every length from 1 to 2^max_size incrementing one byte at a time. These tests take a long time to run, and generate data files suitable for input to gnuplot. 5. Run gm_stress on every host in the cluster to validate GM. Complete details on running gm_stress can be found on the FAQ. http://www.myri.com/scs/faq/faq-html#debug-stress This gm_stress command must be run simultaneously on each host, using the same list of host names in each case. It can be run on any subset of hosts on the network. For a list of all possible runtime options for these commands, you can issue the command with --help as the runtime option, e.g. ./gm_debug --help. ================================ II. Verifying the GM Performance ================================ We recommend the following test to verify the GM performance. View the results of the hardware benchmark test of the PCI bus with the DMA engine of the Myrinet adapter. cd {GM_HOME}/binary/bin ./gm_debug --no-counters Note: The output of this command gives the maximum sustained bandwidth that can be obtained from the PCI bus. Refer to the section entitled "GM Performance" in the {GM_HOME}/README for complete details on expected GM performance. ======================= III. Running IP over GM ======================= The Linux command to enable IP over GM is as follows: /sbin/ifconfig myri0 up where you must replace 'myri0' with the appropriate name (myri1, myr2, etc.) if you have more than one Myrinet interface per host. For more information, please refer to the FAQ (http://www.myri.com/scs/GM_FAQ.html). ============================ IV. Improving IP performance ============================ To get good IP performance over Myrinet: * use Linux-2.4 (Linux-2.4.19 is now available) * configure GM with --enable-new-features (a default for gm-1.5 and later) to get a larger 9000byte MTU for IP-over-Myrinet You definitely want to use Linux 2.4 instead of Linux 2.2, and NFS-v3 over TCP. Linux 2.4 has vastly better TCP/IP and UDP/IP numbers than Linux-2.2. Also, there have been some recent patches to Linux-2.4 that help udp performance. If you are running Linux 2.2 or earlier, you should use the following tuning options to get good NFS bandwidth. Otherwise, you are latency dominated and Myrinet IP and Ethernet IP performance will be about the same. - For linux you want to increase the tcp windows: echo "262144" > /proc/sys/net/core/rmem_max echo "262144" > /proc/sys/net/core/wmem_max echo "262144" > /proc/sys/net/core/wmem_default echo "262144" > /proc/sys/net/core/rmem_default - In linux/include/net/tcp.h, replace the value of #define MAX_WINDOW 32767 with the value of your choice (200k~500k might be good) - check that /proc/sys/net/ipv4/tcp_window_scaling is enabled with the value 1 (as it should be by default). - Play with the buffer sizes of netperf or your favorite net tester. Note: These tunings options are not required for Linux 2.4. ================== V. Fork() Support ================== As of gm-1.5.2 and later, GM has full support for fork() under Linux. It works for all processor families. There are no restrictions; GM can fork() with or without a GM port open. However, if the customer has a choice between using vfork() or fork(), there will be better performance with vfork() since the time to fork a process with vfork() is much shorter. ================================================================ VI. Sample Scripts to automatically load GM and start the Mapper ================================================================ The directory {GM_HOME}/share contains some sample initialization scripts, contributed by customers, that can be customized to suit your system to automatically load the gm driver and start the GM Mapper. ======================================= VII. Operating-system-specific Caveats ======================================= --------------------------------------------------- a. Using Compaq Compilers for Alpha Linux (ccc cxx) --------------------------------------------------- Under the C shell: setenv CC ccc setenv CXX cxx setenv CXXFLAGS \ "-g -O2 -inline speed -x cxx -noexceptions -nocxxstd -using_std -w2" setenv CFLAGS -gcc_messages setenv KCC gcc rm -f config.cache ./configure or under a Bourne shell or Bash: CC=ccc ; export CC CXX=cxx ; export CXX CXXFLAGS="-g -O2 -inline speed -x cxx -noexceptions -nocxxstd" CXXFLAGS="$(CXXFLAGS) -using_std -w2" ; export CXXFLAGS CFLAGS=-gcc_messages ; export CFLAGS KCC=gcc ; export KCC rm -f config.cache ./configure ---------------- b. PCI Chipset Tweaks ---------------- In the file: {GM_HOME}/drivers/linux/gm/gm_arch.c If you have an i840 chipset, modify the flag to be #define GM_INTEL_840 1 There are similar defines for: #define GM_INTEL_860 1 #define GM_21154 1 #define GM_INTEL_450NX 1 #define GM_KT266A 1 Also from this file, please read this warning: /****************** PCI CHIPSET TWEAKS: WARNING ************************* * * * The patches below were supplied by customers who reported that * * their PCI performance was improved when using these patches * * on a particular chipset. * * These patches tweak certain bits in the chipset and have not been * * verified or reviewed by Myricom and may have other, possibly * * negative, side-effects. Before applying one of these patches, * * you may wish to check for a newer BIOS for your machine. * * Also, a newer linux kernel may provide better PCI performance, * * and might be a safer course of action than applying one of * * these patches. * * * * Use these patches at your own risk. * * * ***********************************************************************/ -------------------------------------------------- c. APIC IRQ conflict on Tyan and AMD motherboards -------------------------------------------------- We have encountered APIC IRQ conflicts on several Tyan and AMD motherboards. The installation of GM will fail with an error message similar to the following: GM: LANai rate set to 198 MHz (max=2-2MHz) GM: Board 0 page hash cache has 32768 GM: Allocated IRQ 11 GM: NOTICE: GM: board interrupt (configured on IRQ 11) is not working GM: NOTICE: GM: Failed to initialize Myrinet Card GM: gm: driver unloading GM: WARNING: GM: No Board Initialized ############################# Error Installing GM driver module ############################# or GM: Version 1.5.2.1_Linux build 1.5.2.1_Linux xxxh@xxx.xx.xx Fri Jul 19 14:03:17 EDT 2002 GM: NOTICE: GM: Module not compiled from a real kernel build source tree GM: This build might not be supported. GM: Highmem memory configuration: GM: PAGE_ZERO=0x0, HIGH_MEM=0x3ff80000, KERNEL_HIGH_MEM=0x38000000 GM: Memory available for registration: 224748 pages (877 MBytes) GM: MCP for unit 0: L9 4K (new features) GM: LANai rate set to 133 MHz (max = 134 MHz) GM: Board 0 page hash cache has 32768 bins. GM: Allocated IRQ5 GM: NOTICE: GM: Board interrupt (configured on IRQ 5) is not working. GM: NOTICE: GM: Failed to initialize Myrinet Card GM: gm: driver unloading The IRQ error message says that the driver asked the Myrinet NIC to raise the interrupt that has been assigned by the BIOS to check that it's working, and the driver doesn't receive it in the expected timeout. Thus, the driver cannot use the Myrinet board and exits from the initialization. The most frequent cause for this problem is: * The interrupt lines are managed by an APIC (Advanced Programmable Interrupt Controller) chipset and it is not supported correctly by the BIOS and/or by the current Linux kernel. Possible solutions: 1. Try a different PCI slot. 2. Upgrade the BIOS. 3. Upgrade the Linux kernel version if available. Boot the Linux kernel without APIC support; pass the flag -noapic to the booting kernel via the LILO boot prompt. In this case, the kernel will use a safer compatibility mode. It is important to note that if this error occurs on any node in the cluster, all nodes in the cluster should be booted with -noapic. Refer to the Myrinet FAQ for further details. http://www.myri.com/scs/faq/faq-install.html#install6b --------------------------------- d. AGP (nVidia and ATI) conflicts --------------------------------- Two types of problems were reported. 1. If I load the GM module first, and then load the nVidia or ATI module, it works. But if I load the nVidia or ATI module first, GM won't load. The GM_INSTALL error message looks like: n03 135# ./GM_INSTALL Making device files in /dev. ifconfig myri0 down - in case it was up myri0: unknown interface: No such device Adding new GM driver. sbin/gm: init_module: No such device Hint: insmod errors can be caused by incorrect module parameters, including invalid IO or IRQ parameters **** Error installing GM driver module. **** and then in the kernel log, you see something like: GM: Version 1.5.2_Linux build 1.5.2_Linux x@x Wed Aug 21 16:17:08 PDT 2002 GM: NOTICE: GM: Module not compiled from a real kernel build source tree GM: This build might not be supported. GM: Highmem memory configuration: GM: PAGE_ZERO=0x0, HIGH_MEM=0x7fff0000, KERNEL_HIGH_MEM=0x38000000 GM: Memory available for registration: 451752 pages (1764 MBytes) GM: NOTICE: GM: pci_rev2: Could NOT map board into kernel (span = 0x1000000) GM: WARNING: GM: Can't map IO memory to system memory GM: NOTICE: GM: gm_instance_init failed GM: NOTICE: GM: Failed to initialize Myrinet Card GM: gm: driver unloading GM: WARNING: GM: No board initialized This one is a case of shortage of virtual memory (used for IO-mapping PCI memory) in the Linux kernel. On configurations with a lot of physical memory, there will only be 128Mb of the address space that Linux will always reserve for virtual memory dynamically allocated. Unfortunately the nVidia card seems to eat as much virtual memory as it can (it occupies at least 128Mb in PCI memory space), so if you load it before the gm module on such a configuration, you will have the error reported. The fix is to recommend for people with more than 768Mb of memory and an nVidia or ATI card to apply the following patch to their kernel: --- arch/i386/kernel/setup.c Thu Aug 2 17:00:46 2001 +++ arch/i386/kernel/setup.c.2 Thu Oct 11 09:00:59 2001 @@-815,7 +815,7 @@ /* * 128MB for vmalloc and initrd */ -#define VMALLOC_RESERVE (unsigned long)(128 << 20) +#define VMALLOC_RESERVE (unsigned long)(256 << 20) #define MAXMEM (unsigned long)(-PAGE_OFFSET-VMALLOC_RESERVE) #define MAXMEM_PFN PFN_DOWN(MAXMEM) #define MAX_NONPAE_PFN (1 << 20) And be sure that the HIGHMEM option is enabled while configuring the kernel. If you do not mind losing memory or just to do a test, you can try to boot your current kernel with mem=768m to see if the problem disappears. Refer to the Myrinet FAQ for further details. http://www.myri.com/scs/faq/faq-install.html#install6b 2. Overlapping of prefetch memory for the AGP and PCI bridges. SGI Visual Workstation 550 machine. AGP cards (nVidia Quadro, ATI Mach64 PCI graphics card, ATI Rage AGP). What we see with them is that the prefetchable memory assigned by the BIOS for the AGP and PCI bridges is overlapping. This looks like a BIOS problem and we have asked the customer to look into upgrading the BIOS, or to play with the BIOS settings to attempt to get the BIOS to do the right thing (things to try - toggling the plug-n-play OS setting, change the size of the AGP graphics aperture, reinitialize or re-detect the PCI space in the configuration space, etc.) Specifically, it was seen that: The memory for the Myrinet card is mapped at exactly the same spot with the ATI Mach64 PCI graphics card as it is with the ATI Rage AGP graphics card: 03:01.0 Non-VGA unclassified device: MYRICOM Inc.: Unknown device 8043 (rev 03) Region 0: Memory at 82000000 (64-bit, prefetchable) [size=16M] However, now look at the bridges leading to bus 3 (PCI where Myrinet card is) and bus 1 (AGP) in the ATI Rage AGP config: 00:01.0 PCI bridge: Intel Corporation 82840 840 (Carmel) Chipset AGP Bridge (rev 01) (prog-if 00 [Normal decode]) Bus: primary=00, secondary=01, subordinate=01, sec-latency=64 Prefetchable memory behind bridge: 82300000-850fffff 00:02.0 PCI bridge: Intel Corporation 82840 840 (Carmel) Chipset PCI Bridge (Hub B) (rev 01) (prog-if 00 [Normal decode]) Bus: primary=00, secondary=02, subordinate=03, sec-latency=0 Prefetchable memory behind bridge: 81600000-831fffff See how those the prefetchable memory regions overlap? And, more importantly, see how the bridge to the AGP bus's prefetchable memory region overlaps that of the Myrinet card? Note that the only prefetchable memory on the AGP bus is for the rage card and that this memory is a small subset of the region the bridge is claiming: 01:00.0 VGA compatible controller: ATI Technologies Inc 3D Rage IIC AGP (rev 7a) (prog-if 00 [VGA]) Region 0: Memory at 84000000 (32-bit, prefetchable) [size=16M] This issue is now resolved. You need to download BIOS version A9 from the SGI website. =================== VIII. Miscellaneous =================== ------------------------------------ a. Uninstallation of the GM driver ------------------------------------ The gm_install_drivers script generates the script /sbin/gm_uninstall_drivers, which can be used to uninstall the drivers. The GM_INSTALL script generates the script /sbin/GM_UNINSTALL, which can be used to uninstall GM.