Signal 11 while compiling the kernel

This FAQ describes what the possible causes are for an effect that bothers lots of people lately. Namely that a linux(*)-kernel (or any other large package for that matter) compile crashes with a "signal 11". The cause can be software or (most likely) hardware. Read on to find out more.
(*) Of course nothing is Linux specific. If your hardware is flaky, Linux, Windows 3.1, FreeBSD, Windows NT and NextStep will all crash.
If you are not reading this at http://www.BitWizard.nl/sig11/, that's where you can find the most recent version.
For those of you who prefer reading this in French, the French translation can be found at http://www.linux-france.com/article/sig11fr/.
Email me at R.E.Wolff@BitWizard.nl if you find any spelling errors, worthwhile additions or with an "it also happened to me" story. (Note that I reject some suggested additions on my belief that it is technical nonsense). I would appreciate it if you put "sig11" or something like that in the subject.

The Sig11 FAQ


QUESTION

My kernel compile crashes with
      gcc: Internal compiler error: program cc1 got fatal signal 11
What is wrong with the compiler? Which version of the compiler do I need? Is there something wrong with the kernel?

ANSWER

Most likely there is nothing wrong with your installation, your compiler or kernel. It very likely has something to do with your hardware. There are a variety of subsystems that can be wrong, and there is a variety of ways to fix it. Read on, and you'll find out more. There are two exceptions to this "rule". You could be running low on virtual memory, or you could be installing Red Hat 5. There is more about this near the end.

QUESTION

Ok it may not be the software, How do I know for sure?

ANSWER

First lets make sure it is the hardware that is causing your trouble. When the "make" stops, simply type "make" again. If it compiles a few more files before stopping, it must be hardware that is causing you troubles. If it immediately stops again (i.e. scans a few directories with "nothing to be done for xxxx" before bombing at exactly the same place), try
        dd if=/dev/HARD_DISK of=/dev/null bs=1024k count=MEGS
Change HARD_DISK to "hda" to the name of your harddisk (e.g. hda or sda. Or use "df ."). Change the MEGS to the number of megabytes of main memory that you have. This will cause the first several megabytes of your harddisk to be read from disk, forcing the C source files and the gcc binary to be reread from disk the next time you run it. Now type make again. If it still stops in the same place I'm starting to wonder if you're reading the right FAQ, as it is starting to look like a software problem after all.... Take a peek at the "what are the other possibilities" question..... If without this "dd" command the compiler keeps on stopping at the same place, but moves to another place after you use the "dd" you definitely have a disk->ram transfer problem.

QUESTION

What does it really mean?

ANSWER

Well, the compiler accessed memory outside its memory range. If this happens on working hardware it's a programming error inside the compiler. That's why it says "internal compiler error". However when the hardware occasionally flips a bit, gcc uses so many pointers, that it is likely to end up accessing something outside of its addressing range. (random addresses are mostly outside your addressing range, as not very many people have a significant part of 4G as main memory... :-) It seems that nowadays, everybody with "signal 11" problems gets directed to this page. If you're developping your own software or have software that hasn't been debugged quite enough, "signal 11" (or segmentation fault) is still a strong hint that there is something wrong with the program. Only when you can cause a "known working" program like "gcc" to crash on a dataset (e.g. the Linux-kernel) that has also been well-tested, then it becomes a hint that there is something wrong with your hardware.

QUESTION

Ok. I may have a hardware problem what is it?

ANSWER

If it happens to be the hardware it can be:
  • The Memory hole. Many modern motherboards allow you to use old ISA video cards with one or two megabytes of linear frame buffer. To achieve this, they have to map out the memory just below 16Mb. Nobody actually ever used this feature, but if you turn the memory hole (or LFB support in some BIOSes) on, your machine will certainly be flaky..... -- Paul Connolly (pconnolly@macdux.com.au)

    QUESTION

    RAM timing problems? I fiddled with the bios settings more than a month ago. I've compiled numerous kernels in the mean time and nothing went wrong. It can't be the RAM timing. Right?

    ANSWER

    Wrong. Do you think that the RAM manufacturers have a machine that makes 60ns RAMs and another one that makes 70ns RAMs? Of course not! They make a bunch, and then test them. Some meet the specs for 60 ns, others don't. Those might be 61 ns if the manufacturer would have to put a number to it. In that case it is quite likely that it works in your computer when for example the temperature is below 40 degrees centigrade (chips become slower when the temp rises. That's why some supercomputers need so much cooling).

    However "the coming of summer" or a long compile job may push the temperature inside your computer over the "limit". -- Philippe Troin (ptroin@compass-da.com)


    QUESTION

    I got suckered into not buying ECC memory because it was slightly cheaper. I feel like a fool. I should have bought the more expensive ECC memory. Right?

    ANSWER

    Buying the more expensive ECC memory and motherboards protects you against a certain type of errors: Those that occur randomly by passing alpha particles.
    Because most people can reproduce "signal 11" problems within half an hour using "gcc" but cannot reproduce them by memory testing for hours in a row, that proves to me that it is not simply a random alpha particle flipping a bit. That would get noticed by the memory test too. This means that something else is going on. I have the impression that most sig11 problems are caused by timing errors on the CPU <-> cache <-> memory path. ECC on your main memory doesn't help you in that case. When should you buy ECC? a) When you feel you need it. b) When you have LOTS of RAM. (Why not a cut-off number? Because the cut-off changes with time, just like "LOTS".) Some people feel very strong about everybody using ECC memory. I refer them to reason "a)".

    QUESTION

    Memory problems? My BIOS tests my memory and tells me its ok. I have this fancy DOS program that tells me my memory is OK. Can't be memory right?

    ANSWER

    Wrong. The memory test in the BIOS is utterly useless. It may even occasionally OK more memory than really is available, let alone test whether it is good or not.
    A friend of mine used to have a 640k PC (yeah, this was a long time ago) which had a single 64kbit chip instead of a 256kbit chip in the second 256k bank. This means that he effectively had 320k working memory. Sometimes the BIOS would test 384k as "OK". Anyway, only certain applications would fail. It was very hard to diagnose the actual problem....
    Most memory problems only occur under special circumstances. Those circumstances are hardly ever known. gcc Seems to exercise them. Some memory tests, especially BIOS memory tests, don't. I'm no longer working on creating a floppy with a linux kernel and a good memory tester on it. Forget about bugging me about it......
    The reason is that a memory test causes the CPU to execute just a few instructions, and the memory access patterns tend to be very regular. Under these circumstances only a very small subset of the memories breaks down. If you're studying Electrical Engineering and are interested in memory testing, a masters thesis could be to figure out what's going on. There are computer manufacturers that would want to sponsor such a project with some hardware that clients claim to be unreliable, but doesn't fail the production tests......

    QUESTION

    Does it only happen when I compile a kernel?

    ANSWER

    Nope. There is no way your hardware can know that you are compiling a kernel. It just so happens that a kernel compile is very tough on your hardware, so it just happens a lot when you are compiling a kernel. Compiling other large packages like gcc or glibc also often trigger the sig11.

    QUESTION

    Nothing crashes on NT, Windows 95, OS/2 or DOS. It must be something Linux specific.

    ANSWER

    First of all, Linux stresses your hardware more than all of the above. Some OSes like the Microsoft ones named above crash in unpredictable ways anyway. Nobody is going to call Microsoft and say "hey, my windows box crashed today". If you do anyway, they will tell you that you, the user, made an error (see the interview with Bill Gates in a German magazine....) and that since it works now, you should shut up.
    Those OSes are also somewhat more "predictable" than Linux. This means that Excel might always be loaded in the exact same memory area. Therefore when the bit-error occurs, it is always excel that gets it. Excel will crash. Or excel will crash another application. Anyway, it will seem to be a single application that fails, and not related to memory.
    What I am sure of is that a cleanly installed Linux system should be able to compile the kernel without any errors. Certainly no sig-11 ones. (** Exception: Red Hat 5.0 with a Cyrix processor. See elsewhere. **)
    Really Linux and gcc stress your hardware more than other OSes. If you need a non-linux thingy that stresses your hardware to the point of crashing, you can try winstone. -- Jonathan Bright (bright@informix.com)

    QUESTION

    Is it always signal 11?

    ANSWER

    Nope. Other signals like four, six and seven also occur occasionally. Signal 11 is most common though.

    As long as memory is getting corrupted, anything can happen. I'd expect bad binaries to occur much more often than they really do. Anyway, it seems that the odds are heavily biased towards gcc getting a signal 11. Also seen:

    The first few ones are cases where the kernel "suspects" a kernel-programming-error that is actually caused by the bad memory. The last few point to application programs that end up with the trouble.

    -- S.G.de Marinis (trance@interseg.it)
    -- Dirk Nachtmann (nachtman@kogs.informatik.uni-hamburg.de)


    QUESTION

    What do I do?

    ANSWER

    Here are some things to try when you want to find out what is wrong... note: Some of these will significantly slow your computer down. These things are intended to get your computer to function properly and allow you to narrow down what's wrong with it. With this information you can for example try to get the faulty component replaced by your vendor. The hardest part is that most people will be able to do all of the above except borrowing memory from someone else, and it doesn't make a difference. This makes it likely that it really is the RAM. Currently RAM is the most pricy part of a PC, so you rather not have this conclusion, but I'm sorry, I get lots of reactions that in the end turn out to be the RAM. However don't despair just yet: your RAM may not be completely wasted: you can always try to trade it in for different or more RAM.

    QUESTION

    I had my RAMs tested in a RAM-tester device, and they are OK. Can't be the RAM right?

    ANSWER

    Wrong. It seems that the errors that are currently occuring in RAMS are not detectable by RAM-testers. It might be that your motherboard is accessing the RAMs in dubious ways or otherwise messing up the RAM while it is in YOUR computer. The advantage is that you can sell your RAM to someone who still has confidence in his RAM-tester......

    QUESTION

    What are other possibilities?

    ANSWER

    Others have noted the following possibilities:

    QUESTION

    I don't believe this. To whom has this happened?

    ANSWER

    Well for one it happened to me personally. But you don't have to believe me. It also happened to:
    I'm interested in new stories. If you have a problem and are unsure about what it is, it may help to Email me at R.E.Wolff@BitWizard.nl . My curiosity will usually drive me to answering your questions until you find what the problem is..... (on the other hand, I do get pissed when your problem is clearly described above :-)
    This page is hosted by www.BitWizard.nl