Wednesday, June 24, 2015

C Programming: Pitfalls while word counting

In Earlier post "Count number of characters, words and lines in input", we have seen how to count number of characters,words and lines. Today, we will learn about pitfalls during word counting.

In K&R, we have a separate exercise for this. Exercise 1.14 states "How would you test the word count program? What kinds of input are most likely to uncover bugs if there are any?"

While writing this program earlier, we never handled much cases. There are caveats which needs to be identified and addressed.

First lets list what all possible checks we usually need while word counting.

1. Checking for very short words.
2. Check for very lengthy words.
3. Check for words which separated when new line is encountered. For example kernel
trap. Where kernel is at end of a line and trap follows in next line.4. Considering words like "isn't", "tour's" as single words
5. Check overall files size for size less than 2GB.
6. Check for mistyped words like "kernel  -  trap" which contain spaces in middle or an - instead of space ex.kernel-trap.
7. Check of non ASCII characters
8. Check for different encoding

Please shed your thoughts if I have missing any checks.

Monday, June 1, 2015

C Program : Count number of characters, words and lines in input

Earlier, we have seen "C : Program to display tabs , backspaces visible in an unambiguous way", today we will write a small C program which counts number of characters ,  number of words and number of lines.

Program looks easy when they ask to count lines and numbers but how about words.

If input to stdin is character by character, then we can count characters easily. Also using '\n' , we can identify that line has encountered. Only problem is counting words.

So, we will use a mechanism to find if word has encountered. Lets introduce two states IN and OUT which states currently process word and the other says its out of word respectively.

 #include  
   
 #define IN 1  
 #define OUT 0  
   
 int main()  
 {  
   int c, nl, nw, nc, state;  
   
   state = OUT;  
   nl = nw = nc = 0;  
   
   while((c = getchar()) != EOF)  
   {  
     /* Increment number of characters */  
     ++nc;  
   
     /* Increment number of lines if end of line is encountered */  
     if( c == '\n' )  
     {  
       ++nl;  
     }  
     /* Anything other than character, mark it new word */  
     if( c == ' ' || c == '\n' || c == '\t' )  
     {  
       state = OUT;   /* Its just completed processing a word */
     } /* if new word, increment word count */  
     else if ( state == OUT )  
     {  
       state = IN;  
       ++nw;  
     }  
   }  
   printf(" Number of Characters = %d \n",nc);  
   printf(" Number of lines = %d \n",nl);  
   printf(" Number of Words = %d \n",nw);  
   return 0;  
 }  
   

Output of this program

 mrtechpathi@mrtechpathi:~/Study/C/K_and_R$ ./a.out   
 This program counts  
 number of characters  
 number of words  
 number of lines  
  Number of Characters = 73   
  Number of lines = 4   
  Number of Words = 12   

In this program

  • We read input character by character 
  • First we increment the character (nc)
  • We then check if it is a new line. If new line, we increment the character
  • Now to increment word (collection of characters), we need to do multiple checks which signify end of word. 
  • If its new line or tab or space, we consider that as word and increment word
  • Finally when you press Ctrl+D (in Linux), the program will display the number of characters, number of lines and number of words.
Hope this helped :)