Monday, August 28, 2006

Diferent ways to start a Linux when it doesn't boot

Linux is a hard operating system, with strong security, and a stable kernel (operating system core) and a lot of flexibility, but that can be your worst nightmare if you write the wrong commands: It's easy to destroy a Linux system if you ask it to do it.

One of the more tipical situations is to have a Linux system that doesn't boot. That is a simple problem, but it's possible you feel to be naked, because you are not able to write anything to your Linux, because it doesn't start!!

The more typical non-boot situations I have found are:

  1. The owner of the installation forbid the root password

  2. The new and incredible Linux kernel recently installed give you a kernel panic

  3. LILO hangs in the boot process and the kernel doesn't start

  4. You have installed a new Linux and it doesn't start up


There is a lot of problems that can cause any of the problems named before, but it's clear that in all those situations the first problem we have is that we cannot access a shell, so we cannot write the needed commands to solve the problem. In this page I'll try to solve the more usual problems, but feel free to write a comment if you cannot solve your problem with this tips to help you and complete the page.

So, the first thing we have to do is to access our crashed Linux. There are a few ways, depending of the nature of the problem:

1. Reset the root password


The best problem is to doesn't remember the root password. To solve it we have to start the boot process, pressing the SHIFT key when LILO shows on the screen. A prompt will apear. If we press the TAB key we'll see the available kernels we have. We'll choose the first one usually called linux, so we'll write the following command at LILO prompt:

LILO: linux init=/bin/sh

What we are saying is that Linux has to start executing at first step of the boot process the shell, so we'll access the operating system without entering any password. Now we have to execute another command, because during the boot process the root partition is mounted as read-only, so we need to remount it read-write to reset the password:

# mount -o rw,remount /

Now we can execute the passwd command without parameters to reset the root password.

2. The kernel give you a kernel panic


To solve that problem we need a new kernel to boot the operating system. Under Debian , and installing the new kernel using the dpkg tool, the system save the previous kernel as an old kernel.
When you see the LILO word during the startup process, you can press the SHIFT key to enter a boot command. The TAB key it give us the list of kernels, and writing the name of another kernel in place of linux will boot a previous kernel. Now we'll be able to uninstall the new kernel.

3. LILO hangs in the boot process and the kernel doesn't start


That problem can be solved with a live-cd or the install CDs of our distribution. Under Debian you can use the first or five CD (or the DVD of the distribution) to start a Debian install. At the first screen of the install process press the CTRL+F2 keys to switch to a shell. The following command mounts the root filesystem and executes the lilo to reinstall it:

# mount /dev/hda1 /mnt
# chroot /mnt
# lilo
# exit

It's clear that you have to replace hda1 for your root partition. The root partition is usually located under hda1 for a Linux only disk or hda2 for a dual boot Windows and Linux disk.

4. You have installed a new Linux and it doesn't start up


The last step when you install an operating system is to enable the disk to boot. If there is a problem during this step, or you cancel it, the operating system is correctly installed but the computer doesn't know that the installation was done.

To tell the computer that all is wright and it can start our new operating system we can start again the installation process and execute the same steps of problem numbered 3 (LILO hangs) in this page.

Friday, August 25, 2006

Are your servers cool enough?

During a SAP R/3 implementation a few years ago in our company, with a startup of more than 10 modules and an Add-On (IS-U/CCS), can you say what was our worst problem during the startup? The temperature!

A month prior to the startup, at the begining of June, we received the hardware to install the productive system. The development was very well, adjusted to dates and the startup day, august 1, seemed to be a comfortable date to start, just when the people starts his holidays in Spain.

The problem arises when we start the new productive cluster systems and the SAN array. The temperature of the datacenter was increased until july 20, when the productive system stoped itself for overtemperature.

That mistake, that was solved buying a portable air conditioning system until the datacenter was upgraded with more cooling power, was a serious problem for a project with more than 60 people involved and six month of work.

That's something it will never happen again to me. The new datacenter has cooling power enough to mantain as far as the double of our current servers, and has multiple independent systems, but that will not be enough if you are not aware of what temperature is running on the datacenter, because the cooling systems can fail or can be power off by the cleaning lady ;-)

Now I have multiple hardware termomethers with TCP/IP, SNMP and web server enableds that can be found at W&T. That system has multiple temperature sensors so you can be aware of multiple critical points of temperature. Connecting the system to an snmp agent and you can have an online temperature monitor that can alert you of temperature arises.

About Secunia

As a system administrator, I have found a very useful companion in Secunia, a company that receives information about security vulnerabilities of major hardware and software vendors and classifies the advisories sending in a tagged way only the advisories you are interested in.

It's a way to increase the security of a company or at least to know how are you dealing with security in your company, helping you manage the balance between security and time.

I'm a customer from half a year ago and I'm very glad of the service. I recomend to any sysadmin that has not enought time to take care of bugtrack lists and support web pages, or any CIO concerned about the security of his IT infrastructure.

Oracle ORA 221 Error

Recently I had found an ORA-0021 on one of my customers. The problem was arised when a SCSI controller failed resulting in a I/O error writing one of the control files.

The problem is serious, because the database stops and is not able to start again, but don't panic, the solution is so easy.

Controlfiles are readed during the instance start. You can see the error, as ORA 221 or ORA-00221 in the alert log leaving the database unmounted.

To solve the problem, stop the instance and then, looking the alert log and locating the entry about the 221 error, look for the controlfile corrupted. Then, rename the corrupted file (I have learned in all my years as system administrator that you should not delete never a file, rename it and delete it a few weeks before when you are sure you'll not need any more), and finally copy one of the other controlfiles to the name of the first one.

Now, if you restart the database all should be fine.