Funny issue with "grep"

The other day I was working on a simple program on CentOS and I met an issue which turned out to be quite funny. At first I didn't think it worth a post here, but later I decided to put it here so that I won't ever make the same mistake again.

I was trying to do an easy task: create a simple shell script to kill a running process. Sound easy right? I know and I did think so until I realized something strange happened.

Here is the command I used to kill process with the name X in my system

ps aux | grep X | awk '{ system("kill -9 " $2) }'

The above command will first list all programs running in the system (using ps aux). Because there were too many programs, I filtered only the output with the word "X" in it (using grep X). Finally, for each process with the name 'X' in it, I wanted to run a kill -9 command on its Process ID to kill it. If you don't understand awk command, you might want to read more here: http://www.grymoire.com/Unix/Awk.html, basically it allows you to do something with each line of the input provided by the | (pipeline)

I tested this command several times and it worked as I expected, even though there was actually one minor issue with above command. That command would also try to kill the process created by the grep command and it would fail. To understand this, try running this command in your system:

ps aux | grep abcxyz

Even though there are no processes with the name "abcxyz", we still see the output like this

chientran       39359   0.0  0.0  2445076    796 s006  S+    9:01PM   0:00.01 grep abcxyz

Here, first column is name of the user and second column is the Process ID. You can see that even though you have no processes related to the name abcxyz, we still have one process created by grep, and at the moment it is shown to user, that process has already exited. Hence if you try to kill it, you will get an error: No such process. But it won't affect the other kill command so overall we are still able to kill X.

But that's not the main point of this post, let's continue with our story. I put the above command in a shell script file and named it as kill_X.sh, and granted execute permission to it

chmod +x kill_X.sh

So far so good. Now you expect this shell script to work properly and prevent you from typing a long command to kill X, right? Not quite. I ran this script like this:

./kill_X.sh

And I saw this error again: No such process, and the process X was still running happily. What was going on here?

First, you already understood why we see above error message, it is because of the grep command. But as I pointed out, it should not affect our goal to kill X so why it does not work?

The reason is because of the way I named my shell script. When I ran ./kill_X.sh, there was a process created to handle this script. So the moment when I ran ./kill_X.sh, there were in fact three processes running with the name X in it:

  1. the process of the grep X command inside the kill_X.sh file
  2. the process handling the command ./kill_X.sh
  3. the main process X

And my awk command was running against one process at a time, and the main process X would be the last one to be processed. So the awk '{system("kill -9 " $2)}'had already killed the process handling the command ./kill_X.sh before it reached the main process. And that was exactly what happened to me: my script kill_X.sh was trying to kill itself before it could kill X.

Knowing the issue, I renamed my file to something else without X in it, such as kill_my_program.sh and it worked properly again. Even though I should have used a better way instead of renaming it to a completely hard-to-understand name. Anyway, I was happy to be able resolve this issue thanks to the help of my professor when we were reviewing this issue together. You can now see the power of pair programming, right?

I am not sure if anyone had this kind of issue before, but I hope you have learned something new from this post and do not make the same mistake as I did.

Happy coding and good luck!