Showing posts with label Pattern Matching. Show all posts
Showing posts with label Pattern Matching. Show all posts

Monday, May 2, 2011

Chapter 11: Formatting Your Output


Chapter 11

Formatting Your Output



CONTENTS

The Perl programs you've seen so far produce output using the print function, which writes raw, unformatted text to a file.
Perl also enables you to produce formatted output, using print formats and the built-in function write. Today's lesson describes how to produce formatted output. You'll learn the following:
  • How to define a print format (also sometimes known as a "picture format")
  • How to use the write function
  • How to add formatted values to a print format
  • Which value-field formats are available
  • How to write to other output files
  • How to specify page headers and the page length
  • How to format long character strings
  • How to use the built-in function printf

Defining a Print Format

The following is an example of a simple print format:
format MYFORMAT =

===================================

Here is the text I want to display.

===================================

.

This defines the print format MYFORMAT.
The syntax for print formats is
format formatname =

lines_of_output

.

The special keyword format tells the Perl interpreter that the following lines are a print-format definition. The formatname is a placeholder for the name of the print format being defined (for example, MYFORMAT). This name must start with an alphabetic character and can consist of any sequence of letters, digits, or underscores.
The lines_of_output consists of one or more lines of text that are to be printed when the print format is utilized; these lines are sometimes called picture lines. In the MYFORMAT example, there are three lines of text printed: two lines containing = characters, and the line
Here is the text I want to display.

A print-format definition is terminated with a line containing a period character. This line can contain nothing else; there can be no white space, and the period must be the first character on the line.
Like subroutines, print-format definitions can appear anywhere in program code (even, for example, in the middle of a conditional statement). However, it usually is best to cluster them either at the beginning or the end of the program.

Displaying a Print Format

To display output using a print format, you need to do two things:
  • Set the system variable $~ to the format you want to use
  • Call the built-in function write
Listing 11.1 is an example of a simple program that displays output using a print format.



Listing 11.1. A program that uses a print format.
1:  #!/usr/local/bin/perl

2:  

3:  $~ = "MYFORMAT";

4:  write;

5:  

6:  format MYFORMAT =

7:  ===================================

8:  Here is the text I want to display.

9:  ===================================

10: .



$ program11_1

===================================

Here is the text I want to display.

===================================

$



Line 3 of this program assigns the character string MYFORMAT to the system variable $~. This tells the Perl interpreter that MYFORMAT is the print format to use when calling write.
Line 4 calls write, which sends the text defined in MYFORMAT to the standard output file.
Lines 6-10 contain the definition of the print format MYFORMAT.


NOTE
If you don't specify a print format by assigning to $~, the Perl interpreter assumes that the print format to use has the same name as the file variable being written to. In this example program, if line 3 had not specified MYFORMAT as the print format to use, the Perl interpreter would have tried to use a print format named STDOUT when executing the call to write in line 4, because the call to write is writing to the standard output file

Displaying Values in a Print Format

Of course, the main reason to use print formats is to format values stored in scalar variables or array variables to produce readable output. Perl enables you to do this by specifying value fields as part of a format definition.
Each value field specifies a value: the name of a scalar variable, for example, or an expression. When the write statement is invoked, the value is displayed in the format specified by the value field.
Listing 11.2 shows how value fields work. This program keeps track of the number of occurrences of the letters a, e, i, o, and u in a text file.



Listing 11.2. A program that uses value fields to print output.
1:  #!/usr/local/bin/perl

2:  

3:  while ($line = <STDIN>) {

4:          $line =~ s/[^aeiou]//g;

5:          @vowels = split(//, $line);

6:          foreach $vowel (@vowels) {

7:                  $vowelcount{$vowel} += 1;

8:          }

9:  }

10: $~ = "VOWELFORMAT";

11: write;

12: 

13: format VOWELFORMAT =

14: ==========================================================

15: Number of vowels found in text file:

16:           a: @<<<<<   e: @<<<<<

17:           $vowelcount{"a"}, $vowelcount{"e"}

18:           i: @<<<<<   o: @<<<<<

19:           $vowelcount{"i"}, $vowelcount{"o"}

20:           u: @<<<<<

21:           $vowelcount{"u"}

22: ==========================================================

23: .



$ program11_2

This is a test file.

This test file contains some vowels.

The quick brown fox jumped over the lazy dog.

^D

==========================================================

Number of vowels found in text file:

          a: 3        e: 10

          i: 7        o: 7

          u: 2

==========================================================

$



This program reads one line of input at a time. Line 4 removes everything that is not a, e, i, o, or u from the input line, and line 5 splits the remaining characters into the array @vowels. Each element of @vowels is one character of the input line.
Lines 6-8 count the vowels in the input line by examining the elements of @vowels and adding to the associative array %vowelcount.
Line 10 sets the current print format to VOWELFORMAT; line 11 prints using VOWELFORMAT.
The print format VOWELFORMAT is defined in lines 13-23. Line 16 is an example of a print format line that contains value fields; in this case, two value fields are defined. Each value field has the format @<<<<<, which indicates six left-justified characters. (For a complete description of the possible value fields, see the section called "Choosing a Value-Field Format," later today.)
When one or more value fields appear in a print-format line, the next line must define the value or values to be printed in this value field. Because line 16 defines two value fields, line 17 defines the two values to be printed. These values are $vowelcount{"a"} and $vowelcount{"e"}, which are the number of occurrences of a and e, respectively.
Similarly, line 18 defines two more value fields to be printed, and line 19 indicates that the values to be printed in these fields are $vowelcount{"i"} and $vowelcount{"o"}. Finally, line 20 defines a fifth value field, and line 21 specifies that $vowelcount{"u"} is to be printed in this field.


NOTE
Three things to note about the values that are specified for value-field formats:
  • The lines containing values to be printed are not themselves printed. For example, in Listing 11.2, lines 16, 18, and 20 are printed, but lines 17, 19, and 21 are not.
  • The Perl interpreter ignores spacing when it looks for values corresponding to value fields. Many people prefer to line up their values with the corresponding value fields on the previous line, but there is no need to do so.
  • The number of values specified must match the number of value fields defined on the previous line

Creating a General-Purpose Print Format

One disadvantage of print formats as defined in Perl is that scalar-variable names are included as part of the definition. For example, in the following definition, the scalar variable $winnum is built into the print format definition MYFORMAT:
format MYFORMAT =

==========================================================

The winning number is @<<<<<<!

$winnum

==========================================================

.

When write is called with this print format, as in the following, you have to remember that $winnum is being used by MYFORMAT.
$~ = "MYFORMAT";

write;

If, later on, you accidentally delete all references to $winnum in the program, the call to write will stop working properly.
One way to get around this problem is to call write from within a subroutine, and to use variables local to the subroutine in the print format that write uses. Listing 11.3 is a program that does this. It reads a file from the standard input file and prints out the number of occurrences of the five most frequently occurring letters.



Listing 11.3. A program that calls write from within a subroutine.
1:  #!/usr/local/bin/perl

2:  

3:  while ($line = <STDIN>) {

4:          $line =~ tr/A-Z/a-z/;

5:          $line =~ s/[^a-z]//g;

6:          @letters = split(//, $line);

7:          foreach $letter (@letters) {

8:                  $lettercount{$letter} += 1;

9:          }

10: }

11: 

12: $~ = "WRITEHEADER";

13: write;

14: $count = 0;

15: foreach $letter (reverse sort occurrences

16:                 (keys(%lettercount))) {

17:         &write_letter($letter, $lettercount{$letter});

18:         last if (++$count == 5);

19: }

20: 

21: sub occurrences {

22:         $lettercount{$a} <=> $lettercount{$b};

23: }

24: sub write_letter {

25:         local($letter, $value) = @_;

26: 

27:         $~ = "WRITELETTER";

28:         write;

29: }

30: format WRITEHEADER =

31: The five most frequently occurring letters are:

32: .

33: format WRITELETTER =

34:         @:  @<<<<<<

35:         $letter, $value

36: .



$ program11_3

This is a test file.

This test file contains some input.

The quick brown fox jumped over the lazy dog.

^D

The five most frequently occurring letters are:

        t: 10

        e: 9

        i: 8

        s: 7

        o: 6

$



Like the vowel-counting program in Listing 11.2, this program processes one line of input at a time. Line 4 translates all uppercase alphabetic characters into lowercase, so that they can be included in the letter count. Line 5 gets rid of all characters that are not letters, including any white space.
Line 6 splits the line into its individual letters; lines 7-9 examine each letter and increment the appropriate letter counters, which are stored in the associative array %lettercount.
Lines 12 and 13 print the following line by setting the current print format to WRITEHEADER and calling write:
The five most frequently occurring letters are:



Except in very special cases, never mix calls to write with calls to print. Your program should use one printing function or the other, not both

Lines 15-19 sort the array %lettercount in order of occurrence. The first letter to appear in the foreach loop is the letter that appears most often in the file. To sort the array in order of occurrence, lines 15 and 16 specify that sorting is to be performed according to the rules defined in the subroutine occurrences. This subroutine tells the Perl interpreter to use the values of the associative array elements as the sort criterion.
Line 17 passes the letter and its occurrence count to the subroutine write_letter. This subroutine sets the current print format to WRITELETTER; this print format refers to the local scalar variables $letter and $value, which contain the values passed to write_letter by line 17. This means that each call to write_letter prints the letter and value currently being examined by the foreach loop.
Note that the first value field in the print format WRITELETTER contains only a single character, @. This indicates that the write field is only one character long (which makes sense, because this is a single letter).
Line 18 ensures that the foreach loop quits after the five most frequently used letters have been examined and printed.


TIP
Some programs, such as the one in Listing 11.3, use more than one print-format definition. To make it easier to see which print format is being used by a particular call to write, always keep the print format specification statement and the write call together. For example:
$~ = "WRITEFORMAT";
write;
Here, it is obvious that the call to write is using the print format WRITEFORMAT

Formats and Local Variables

In Listing 11.3, you might have noticed that the subroutine write_letter calls a subroutine to write out a letter and its value:
sub write_letter {

        local($letter, $value) = @_;



        $~ = "WRITELETTER";

        write;

}

This subroutine works properly even though the WRITELETTER print format is defined outside the subroutine.
Note, however, that local variables defined using my cannot be written out using a print format unless the format is defined inside the subroutine. (To see this for yourself, change line 25 of Listing 11.3 to the following and run the program again:
my($letter,$value) = @_;

You will notice that the letter counts do not appear.) This limitation is a result of the way local variables defined using my are stored by the Perl interpreter. To avoid this difficulty, use local instead of my when you define local variables that are to be written out using write. (For a discussion of local and my, see Day 9, "Using Subroutines.")
Perl 4 users will not run into this problem, because my is not defined for that version of the language.


NOTE
In versions of Perl 5 earlier than version 5.001, local variables defined using my cannot be written out at all. Even in version 5.001, variables defined using my might not behave in the way you expect them to. As a consequence, it is best to avoid using my with print formats

Choosing a Value-Field Format

Now that you know how print formats and write work, it's time to look at the value-field formats that are available. Table 11.1 lists these formats.


Table 11.1. Valid value-field formats.

FieldValue-field format
@<<<Left-justified output
@>>>Right-justified output
@|||Centered output
@##.##Fixed-precision numeric
@*Multiline text


NOTE
In left-justified output, the value being displayed appears at the left end of the value field. In right-justified output, the value being displayed appears at the right end of the value field

In each of the field formats, the first character is a line-fill character. It indicates whether text formatting is required. If the @ character is specified as the line fill character, text formatting is not performed. (For a discussion of text formatting, see the section titled "Formatting Long Character Strings," later today.)
In all cases, except for the multiline value field @*, the width of the field is equal to the number of characters specified. The @ character is included when counting the number of characters in the value field. For example, the following field is five characters wide-one @ character and four > characters:
@>>>>

Similarly, the following field is seven characters wide-four before the decimal point, two after the decimal point, and the decimal point itself:
@###.##

Listing 11.4 illustrates how you can use the value field formats to produce a neatly printed report. The report is redirected to a file for later printing.



Listing 11.4. A program that uses the various value-field formats.
1:  #!/usr/local/bin/perl

2:  

3:  $company = <STDIN>;

4:  $~ = "COMPANY";

5:  write;

6:  

7:  $grandtotal = 0;

8:  $custline = <STDIN>;

9:  while ($custline ne "") {

10:         $total = 0;

11:         ($customer, $date) = split(/#/, $custline);

12:         $~ = "CUSTOMER";

13:         write;

14:         while (1) {

15:                 $orderline = <STDIN>;

16:                 if ($orderline eq "" || $orderline =~ /#/) {

17:                         $custline = $orderline;

18:                         last;

19:                 }

20:                 ($item, $cost) = split(/:/, $orderline);

21:                 $~ = "ORDERLINE";

22:                 write;

23:                 $total += $cost;

24:         }

25:         &write_total ("Total:", $total);

26:         $grandtotal += $total;

27: }

28: &write_total ("Grand total:", $grandtotal);

29: 

30: sub write_total {

31:         local ($totalstring, $total) = @_;

32:         $~ = "TOTAL";

33:         write;

34: }

35: 

36: format COMPANY =

37: ************* @|||||||||||||||||||||||||||||| *************

38: $company

39: .

40: format CUSTOMER =

41: @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<                @>>>>>>>>>>>>

42: $customer, $date

43: .

44: format ORDERLINE =

45:           @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<           @####.##

46: $item, $cost

47: .

48: format TOTAL =

49: @<<<<<<<<<<<<<<                                   @#####.##

50: $totalstring, $total

51: 

52: .



$ program11_4 >report

Consolidated Widgets, Inc.

John Doe#Feb 11, 1994

1 flying widget:171.42

1 crawling widget:89.99

Mary Smith#May 4, 1994

2 swimming widgets:203.43

^D

$

The following report is written to the report file:
*************   Consolidated Widgets, Inc.    *************

John Doe                                       Feb 11, 1994

          1 flying widget                            171.42

          1 crawling widget                           89.99

Total:                                               261.41



Mary Smith                                      May 4, 1994

          2 swimming widgets                         203.43

Total:                                               203.43



Grand total:                                         464.84


This program starts off by reading the company name from the standard input file and then writing it out. Line 5 writes the company name using the print format COMPANY, which uses a centered output field to display the company name in the center of the line.
After the company name has been printed, the program starts processing data for one customer at a time. Each customer record is assumed to consist of a customer name and date followed by lines of orders. The customer name record uses a # character as the field separator, and the order records use : characters as the separator; this enables the program to distinguish one type of record from the other.
Line 13 prints the customer information using the CUSTOMER print format. This format contains two fields: a left-justified output field for the customer name, and a right-justified output field for the date of the transaction.
Line 22 prints an order line using the ORDERLINE print format. This print format also contains two fields: a left-justified output field indicating the item ordered, and a numeric field to display the cost of the item.
The value field format @####.## indicates that the cost is to be displayed as a floating-point number. This number is defined as containing at most five digits before the decimal point, and two digits after.
Finally, the print format TOTAL prints the customer total and the grand total. Because this print format is used inside a subroutine, the same print format can be used to print both totals.


Normally, any floating-point number you print is rounded up when necessary. For example, when you print 43.999 in the value field @#.##, it appears as 44.00.
However, a floating-point number whose last decimal place is 5 might or might not round correctly. For example, if you are writing using the value field @#.##, some numbers whose third and last decimal place is 5 will round and others will not. This happens because some floating-point numbers cannot be stored exactly, and the nearest equivalent number that can be stored is a slightly smaller number (which rounds down, not up)

Printing Value-Field Characters

As you have seen, certain characters such as @, <, and > are treated as value fields when they are encountered in print formats. Listing 11.5 shows how to actually print one of these special characters using write.



Listing 11.5. A program that prints a value-field character.
1:  #!/usr/local/bin/perl

2:

3:  format SPECIAL =

4:  This line contains the special character @.

5:  "@"

6:  .

7:

8:  $~ = "SPECIAL";

9:  write;



$ program11_5

This line contains the special character @.

$


The print format line in line 4 contains the special character @, which is a one-character value field. Line 5 specifies that the string @ is to be displayed in this value field when the line is printed.

Using the Multiline Field Format

Listing 11.6 uses the multiline field format @* to write a character string over several lines.



Listing 11.6. A program that writes a string using the multiline field format.
1:  #!/usr/local/bin/perl

2:  

3:  @input = <STDIN>;

4:  $string = join("", @input);

5:  $~ = "MULTILINE";

6:  write;

7:  

8:  format MULTILINE =

9:  ****** contents of the input file: ******

10: @*

11: $string

12: *****************************************

13: .



$ program11_6

Here is a line of input.

Here is another line.

Here is the last line.

^D

****** contents of the input file: ******

Here is a line of input.

Here is another line.

Here is the last line.

*****************************************

$



Line 3 reads the entire input file into the array variable @input. Each element of the list stored in @input is one line of the input file.
Line 4 joins the input lines into a single character string, stored in $string. This character string still contains the newline characters that end each line.
Line 6 calls write using the print format MULTILINE. The @* value field in this print-format definition indicates that the value stored in $string is to be written out using as many lines as necessary. This ensures that the entire string stored in $string is written out.


If a character string contains a newline character, the only way to display the entire string using write is to use the @* multiline value field. If you use any other value field, only the part of the string preceding the first newline character is displayed

Writing to Other Output Files

So far, all of the examples that have used the function write have written to the standard output file. However, you can use write also to send output to other files.
The simplest way to do this is to pass the file to write to as an argument to write. For example, to write to the file represented by the file variable MYFILE using the print format MYFILE, you can use the following statement:
write (MYFILE);

Here, write writes to the file named MYFILE using the default print format, which is also MYFILE. This is tidy and efficient, but somewhat restricting because, in this case, you can't use $~ to choose the print format to use.
The $~ system variable only works with the default file variable, which is the file variable to which write sends output. To change the default file variable, and therefore change the file that $~ affects, call the built-in function select, as follows:
select (MYFILE);

select sets the default file variable to use when writing. For example, to write to the file represented by the file variable MYFILE using the print format MYFORMAT, you can use the following statements:
select(MYFILE);

$~ = "MYFORMAT";

write;

Here, the built-in function select indicates that the file to be written to is the file represented by the file variable MYFILE. The statement
$~ = "MYFORMAT";

selects the print format to be associated with this particular file handle; in this case, the print format MYFORMAT is now associated with the file variable MYFILE.


NOTE
This is worth repeating: Each file variable has its own current print format. An assignment to $~ only changes the print format for the current file variable (the last one passed to select)

Because select has changed the file to be written to, the call to write no longer writes to the standard output file. Instead, it writes to MYFILE. Calls to write continue to write to MYFILE until the following statement is seen:
select(STDOUT);

This statement resets the write file to be the standard output file.


Changing the write file using select not only affects write; it also affects print. For example, consider the following:
select (MYFILE);
print ("Here is a line of text.\n");
This call to print writes to MYFILE, not to the standard output file. As with write, calls to print continue to write to MYFILE until another call to select is seen

The select function is useful if you want to be able to use the same subroutine to write to more than one file at a time. Listing 11.7 is an example of a simple program that does this.



Listing 11.7. A program that uses the select function.
1:  #!/usr/local/bin/perl

2:  

3:  open (FILE1, ">file1");

4:  $string = "junk";

5:  select (FILE1);

6:  &writeline;

7:  select (STDOUT);

8:  &writeline;

9:  close (FILE1);

10: 

11: sub writeline {

12:         $~ = "WRITELINE";

13:         write;

14: }

15: 

16: format WRITELINE =

17:         I am writing @<<<<< to my output files.

18:                      $string

19: .



$ program11_7

       I am writing junk   to my output files.

$


Line 5 of this program calls select, which sets the default file variable to FILE1. Now, all calls to write or print write to FILE, not the standard output file.
Line 6 calls writeline to write a line. This subroutine sets the current print format for the default file variable to WRITELINE. This means that the file FILE1 now is using the print format WRITELINE, and, therefore, the subroutine writes the following line to the file FILE1 (which is file1):
I am writing junk   to my output files.

Line 7 sets the default file variable back to the standard output file variable, STDOUT. This means that write and print now send output to the standard output file. Note that the current print format for the standard output file is STDOUT (the default), not WRITELINE; the assignment to $~ in the subroutine WRITELINE affects only FILE1, not STDOUT.
Line 8 calls writeline again; this time, the subroutine writes a line to the standard output file. The assignment
$~ = "WRITELINE";

in line 12 associates the print format WRITELINE with the standard output file. This means that WRITELINE is now associated with both STDOUT and FILE1.
At this point, the call to write in line 13 writes the line of output that you see on the standard output file.


DO, whenever possible, call select and assign to $~ immediately before calling write, as follows:
select (MYFILE);
$~ = "MYFORMAT";
write;
Keeping these statements together makes it clear which file is being written to and which print format is being used.
DON'T use select and $~ indiscriminately, because you might lose track of which print format goes with which file variable, and you might forget which file variable is the default for printing

Saving the Default File Variable

When select changes the default file variable, it returns an internal representation of the file variable that was last selected. For example:
$oldfile = select(NEWFILE);

This call to select is setting the current file variable to NEWFILE. The old file variable is now stored in $oldfile. To restore the previous default file variable, you can call select as follows:
select ($oldfile);

At this point, the default file variable reverts back to its original value (what it was before NEWFILE was selected).


The internal representation of the file variable returned by select is not necessarily the name of the file variable

You can use the return value from select to create subroutines that write to the file you want to write with, using the print format you want to use, without affecting the rest of the program. For example:
sub write_to_stdout {

        local ($savefile, $saveformat);

        $savefile = select(STDOUT);

        $saveformat = $~;

        $~ = "MYFORMAT";

        write;

        $~ = $saveformat;

        select($savefile);

}

This subroutine calls select to set the default output file to STDOUT, the standard output file. The return value from select, the previous default file, is saved in $savefile.
Now that the default output file is STDOUT, the next step is to save the current print format being used to write to STDOUT. The subroutine does this by saving the present value of $~ in another local variable, $saveformat. After this is saved, the subroutine can set the current print format to MYFORMAT. The call to write now writes to the standard output file using MYFORMAT.
After the call to write is complete, the subroutine puts things back the way they were. The first step is to reset $~ to the value stored in $saveformat. The final step is to set the default output file back to the file variable whose representation is saved in $savefile.
Note that the call to select must appear after the assignment to $~. If the call to select had been first, the assignment to $~ would change the print format associated with the original default file variable, not STDOUT.
As you can see, this subroutine doesn't need to know what the default values outside the subroutine are. Also, it does not affect the default values outside the subroutine.

Specifying a Page Header

If you are sending your output to a printer, you can make your output look smarter by supplying text to appear at the top of every page in your output. This special text is called a page header.
If a page header is defined for a particular output file, write automatically paginates the output to that file. When the number of lines printed is greater than the length of a page, write starts a new page.
To define a page header for a file, create a print format definition with the name of filename_TOP, where filename is a placeholder for the name of the file variable corresponding to the file to which you are writing. For example, to define a header for writing to standard output, define a print format named STDOUT_TOP, as follows:
format STDOUT_TOP =

Consolidated Widgets Inc. 1994 Annual Report

.

In this case, when the Perl interpreter starts a new page of standard output, the contents of the print format STDOUT_TOP are printed automatically.
Print formats that generate headers can contain value fields which are replaced by scalar values, just like any other print format. One particular value that is often used in page headers is the current page number, which is stored in the system variable $%. For example:
format STDOUT_TOP =

Page @<<.

$%

.

In this case, when the first page is printed, the program prints the following header at the top of the page:
Page 1.



NOTE
By default, $% is initially set to zero and is incremented every time a new page begins.
To change the pagination, change the value of $% before (or during) printing

Changing the Header Print Format

To change the name of the print format that prints a page header for a particular file, change the value stored in the special system variable $^.
As with $~, only the value for the current default file can be changed. For example, to use the print format MYHEADER as the header file for the file MYFILE, add the following statements:
$oldfile = select(MYFILE);

$^ = "MYHEADER";

select($oldfile);

.

These statements set MYFILE to be the current default file, change the header for MYFILE to be the print format MYHEADER, and then reset the current default file to its original value.

Setting the Page Length

By default, the page length is 60 lines. To specify a different page length, change the value stored in the system variable $=:
$= = 66;     # set the page length to 66 lines

This assignment must appear before the first write statement.


If the page length is changed in the middle of the program, the new page length will not be used until a new page is started

Listing 11.8 shows how you can set the page length and define a page-header print format for your output file.



Listing 11.8. A program that sets the length and print format for a page.
1:  #!/usr/local/bin/perl

2:  

3:  open (OUTFILE, ">file1");

4:  select (OUTFILE);

5:  $~ = "WRITELINE";

6:  $^ = "TOP_OF_PAGE";

7:  $= = 60;

8:  while ($line = <STDIN>) {

9:          write;

10: }

11: close (OUTFILE);

12: 

13: format TOP_OF_PAGE =

14:                                     - page @<

15:                                              $%

16: .

17: format WRITELINE =

18: @>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

19: $line

20: .



Suppose that you supply the following input:
$ program11_8

Here is a line of input.

Here is another line.

Here is the last line.

^D

$

The following output is written to the file file1:
- page 1

                 Here is a line of input.

               Here is another line.

              Here is the last line.



Line 3 opens the file file1 for output and associates it with the file variable OUTFILE.
Line 4 sets the current default file to OUTFILE. Now, when write or print is called with no file variable supplied, the output is sent to OUTFILE.
Line 5 indicates that WRITELINE is the print format to be used when writing to the file OUTFILE. To do this, it assigns WRITELINE to the system variable $~. This assignment does not affect the page header.
Line 6 indicates that TOP_OF_PAGE is the print format to be used when printing the page headers for the file OUTFILE. This assignment does not affect the print format used to write to the body of the page.
Line 7 sets the page length to 60 lines. This page length takes effect immediately, because no output has been written to OUTFILE.

Using print with Pagination

Normally, you won't want to use print if you are using pagination, because the Perl interpreter keeps track of the current line number on the page by monitoring the calls to write. If you must use a call to print in your program and you want to ensure that the page counter includes the call in its line count, adjust the system variable $-. This system variable indicates the number of lines between the current line and the bottom of the page. When $- reaches 0, a top-of-form character is generated, which starts a new page.
The following is a code fragment that calls print and then adjusts the $- variable:
print ("Here is a line of output\n");

$- -= 1;

When $- has 1 subtracted from its value, the page counter becomes correct.

Formatting Long Character Strings

As you've seen, the @* value field prints multiple lines of text. However, this field prints the output exactly as it is stored in the character string. For example, consider Listing 11.9, which uses @* to write a multiline character string.



Listing 11.9. A program that illustrates the limitations of the @* value field.
1:  #!/usr/local/bin/perl

2:  

3:  $string = "Here\nis an unbalanced line of\ntext.\n";

4:  $~ = "OUTLINE";

5:  write;

6:  

7:  format OUTLINE =

8:  @*

9:  $string

10: .



$ program11_9

Here

is an unbalanced line of

text.

$



This call to write displays the character string stored in $string exactly as is. Perl enables you to define value fields in print-format definitions that format text. To do this, replace the initial @ character in the value field with a ^ character. When text formatting is specified, the Perl interpreter tries to fit as many words as possible into the output line.
Listing 11.10 is an example of a simple program that does this.



Listing 11.10. A program that uses a value field that does formatting.
1:  #!/usr/local/bin/perl

2:  

3:  $string = "Here\nis an unbalanced line of\ntext.\n";

4:  $~ = "OUTLINE";

5:  write;

6:  

7:  format OUTLINE =

8:  ^<<<<<<<<<<<<<<<<<<<<<<<<<<<

9:  $string

10: .



$ program11_10

Here is an unbalanced line

$



Line 5 calls write using the print format OUTLINE. This print format contains a value field that specifies that formatting is to take place; this means that the Perl interpreter tries to fit as many words as possible into the line of output. In this case, the first line Here and the four-word string is an unbalanced line fit into the output line.
Note that there are two characters left over in the output line after the four words have been filled in. These characters are not filled, because the next word is not short enough to fit into the space remaining. Only entire words are filled.
One other feature of the line-filling operation is that the substring printed out is actually deleted from the scalar variable $string. This means that the value of $string is now of\ntext.\n. This happens because subsequent lines of output in the same print-format definition can be used to print the rest of the string.


NOTE
Because the line-filling write operation updates the value used, the value must be contained in a scalar variable and cannot be the result of an expression

To see how multiple lines of formatted output work, look at Listing 11.11. This program reads a quotation from the standard input file and writes it out on three formatted lines of output.



Listing 11.11. A program that writes out multiple formatted lines of output.
1:  #!/usr/local/bin/perl

2:  

3:  @quotation = <STDIN>;

4:  $quotation = join("", @quotation);

5:  $~ = "QUOTATION";

6:  write;

7:  

8:  format QUOTATION =

9:  Quotation for the day:

10: -----------------------------

11:    ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

12:    $quotation

13:    ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

14:    $quotation

15:    ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

16:    $quotation

17: -----------------------------

18: .



$ program11_11

Any sufficiently advanced programming

language is indistinguishable from magic.

^D

Quotation for the day:

-----------------------------

   Any sufficiently advanced programming language is   

   indistinguishable from magic.                       



-----------------------------

$



The print format QUOTATION defines three value fields on which formatting is to be employed. Each of the three value fields uses the value of the scalar variable $quotation.
Before write is called, $quotation contains the entire quotation with newline characters appearing at the end of each input line. When write is called, the first value field in the print format uses as much of the quotation as possible. This means that the following substring is written to the standard output file:
Any sufficiently advanced programming language is

After the substring is written, it is removed from $quotation, which now contains the following:
indistinguishable from magic.

Because the written substring has been removed from $quotation, the remainder of the string can be used in subsequent output lines. Because the next value field in the print format also wants to use $quotation, the remainder of the string appears on the second output line and is deleted. $quotation is now the empty string.
This means that the third value field, which also refers to $quotation, is replaced by the empty string, and a blank line is written out.


The scalar variable containing the output to be printed is changed by a write operation. If you need to preserve the information, copy it to another scalar variable before calling write

Eliminating Blank Lines When Formatting

You can eliminate blank lines such as the one generated by Listing 11.11. To do this, put a ~ character at the beginning of any output line that is to be printed only when needed.
Listing 11.12 modifies the quotation-printing program to print lines only when they are not blank.



Listing 11.12. A program that writes out multiple formatted lines of output and suppresses blank lines.
1:  #!/usr/local/bin/perl

2:  

3:  @quotation = <STDIN>;

4:  $quotation = join("", @quotation);

5:  $~ = "QUOTATION";

6:  write;

7:  

8:  format QUOTATION =

9:  Quotation for the day:

10: -----------------------------

11: ~  ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

12:    $quotation

13: ~  ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

14:    $quotation

15: ~  ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

16:    $quotation

17: -----------------------------

18: .



$ program11_12

Any sufficiently advanced programming

language is indistinguishable from magic.

^D

Quotation for the day:

-----------------------------

   Any sufficiently advanced programming language is   

   indistinguishable from magic.                       

-----------------------------

$


If the quotation is too short to require all the lines, remaining lines are left blank. In this case, the quotation requires only two lines of output, so the third isn't printed.
The program is identical to the one in Listing 11.11 in all other respects. In particular, the value of $quotation after the call to write is still the empty string.

Supplying an Indefinite Number of Lines

While Listing 11.12 suppresses blank lines, it imposes an upper limit of three lines. Quotations longer than three lines are not printed in their entirety. To indicate that the formatted output is to use as many lines as necessary, specify two ~ characters at the beginning of the output line containing the value field. Listing 11.13 modifies the quotation program to allow quotations of any length.



Listing 11.13. A program that writes out as many formatted lines of output as necessary.
1:  #!/usr/local/bin/perl

2:  

3:  @quotation = <STDIN>;

4:  $quotation = join("", @quotation);

5:  $~ = "QUOTATION";

6:  write;

7:  

8:  format QUOTATION =

9:  Quotation for the day:

10: -----------------------------

11: ~~ ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

12:    $quotation

13: -----------------------------

14: .



$ program11_13

Any sufficiently advanced programming

language is indistinguishable from magic.

^D

Quotation for the day:

----------------------------

   Any sufficiently advanced programming language is   

   indistinguishable from magic.                       

-----------------------------

$



The ~~ characters at the beginning of the output field indicate that multiple copies of the output line are to be supplied. The output line is to be printed until there is nothing more to print.
In Listing 11.13, two copies of the line are needed.

Formatting Output Using printf

If you want to write output that looks reasonable without going to all the trouble of using write and print formats, Perl provides a built-in function, printf, that prints formatted output.


NOTE
If you are familiar with the C programming language, the behavior of printf in Perl will be familiar; the Perl printf and the C printf are basically the same

The arguments passed to the printf function are as follows:
  • The string to be printed, which can contain one or more field specifiers
  • One value for each field specifier appearing in the string to be printed
When printf sees a field specifier, it substitutes the corresponding value in the printf argument list. The representation of the substituted value in the string depends on the field specifier that is supplied.
Field specifiers consist of the % character followed by a single character that represents the format to use when printing. Table 11.2 lists the field-specifier formats and the field-specifier character that represents each.


Table 11.2. Field specifiers for printf.

Specifier
Description
%c
Single character
%d
Integer in decimal (base-10) format
%e
Floating-point number in scientific notation
%f
Floating-point number in "normal" (fixed-point) notation
%g
Floating-point number in compact format
%o
Integer in octal (base-8) format
%s
Character string
%u
Unsigned integer
%x
Integer in hexadecimal (base-16) format

Here is a simple example of a call to printf:
printf("The number I want to print is %d.\n", $number);

The string to be printed contains one field specifier, %d, which represents an integer. The value stored in $number is substituted for the field specifier and printed.
Field specifiers also support a variety of options, as follows:
  • If you are printing an integer using the d, o, u, or x format, you can put an l character in front of the field-specifier character (as in, for example, %ld). This character specifies that the number is a decimal integer in the machine's "long integer" format (corresponding to the C type long). This is useful if your integer is large or might be.
  • A positive integer following the % character indicates the minimum width of the field. For example, %20s prints a character string in a field of 20 characters. If the string is not large enough to fill the entire field, it is right justified (placed at the right end of the field) and padded with blanks. (If the integer starts with a leading 0, as in %08d, the field is padded with zeros, not blanks.)
  • A negative integer following the % character indicates the width of the field and requests left justification. For example, %-15s prints a character string in a field of 15 characters, and it fills the right end of the field with blanks if the string is not large enough.
  • If you are using a field specifier that prints a floating-point number (%e, %f, or %g), you can specify the number of digits that are to appear after the decimal point. To do this, specify a floating-point number after the % character. For example:
    %8.3f
  • Here, the number preceding the decimal point is the field width (as before), and the number after the decimal point is the number of decimal places to print.


If a floating-point number contains more digits than the field specifier wants, the number is rounded to the number of decimal places needed. For example, if 43.499 is being printed using the field %5.2f, the number actually printed is 43.50.
As with the write value field @##.##, printf might not always round up when it is handling numbers whose last decimal place is 5. This happens because some floating-point numbers cannot be stored exactly, and the nearest equivalent number that can be stored is a slightly smaller number (which rounds down, not up). For example, 43.495 when printed by %5.2f might print 43.49, depending on how 43.495 is stored

  • If you are using a field specifier that prints an integer, character, or string, supplying a floating-point number after the % character specifies the maximum length of the value to be printed. In the following example a character string is printed in a 15-character field, but the string itself can be at most 10 characters long:
    %15.10s
  • This guarantees that at least five spaces will appear in the printed line.


NOTE
You can use printf to print to other files. To do this, specify the file variable corresponding to the file to which you want to print, just as you would with print or write
printf MYFILE ("I am printing %d.\n", $value);
This means that changing the current default file using select affects printf.

Summary

Perl enables you to format your output using print-format definitions and the built-in function write. In print-format definitions, you can specify value fields that are to be replaced by either the contents of scalar variables or the values of expressions.
Value fields indicate how to print the contents of a scalar variable or the value of an expression. With a value field, you can specify that the value is to be left justified (blanks added on the right), right justified (blanks added on the left), centered, or displayed as a floating-point number.
You also can define value fields that format a multiline character string. Blank lines can be suppressed, and the field can be defined to use as many output lines as necessary.
The built-in function select enables you to change the default file to which write and print send output.
You can break your output into pages by defining a special header print format that prints header information at the top of each page.
The following system variables enable you to control how write sends output to a file:
  • The system variable $~ contains the name of the print format being used by the current default file.
  • The system variable $^ contains the name of the print format being used as a page header by the current default file.
  • The system variable $= contains the number of lines per printed page.
  • The system variable $- contains the number of lines left on the current page.
The built-in function printf enables you to format an individual line of text using format specifiers.

Q&A

Q:Which is better, write or printf?
A:It depends on what you want to do. If you want to print reports or control pagination, you'll need to use write. If you just want individual lines of output to look neat, printf might be what you need.
Q:How do I generate a page break?
A:To do this, set $- to zero. This generates a top-of-form character.
Q:Why do value fields that format text modify the contents of the scalar variable containing the text?
A:When formatted text is printed, the printed text is removed from the scalar variable, and the part of the string that is not printed is retained. This enables you to use other calls to write to print the remainder of the text. In fact, you can print the rest of the text in the scalar variable using a completely different print format.
Q:How many print formats can I define?
A:Basically, as many as you like, provided the resulting Perl program can still fit in your machine.

Workshop

The Workshop provides quiz questions to help you solidify your understanding of the material covered and exercises to give you experience in using what you've learned. Try to understand the quiz and exercise answers before you go on to tomorrow's lesson.

Quiz

  1. Define value fields that print the following:
    a.    Ten left-justified characters
    b.    Five right-justified characters
    c.    Two centered characters
    d.    A floating-point number with five digits before the decimal point and three after it
    e.    A field that prints as many formatted lines of 30 left-justified characters as necessary
  2. What do these fields print?
    a.    @<<<<
    b
    .    @||||||
    c
    .    @
    d
    .    @*
    e
    .    ~ ^>>>>>>>>>
  3. What do these printf field specifiers print?
    a.    %5d
    b
    .    %11.4f
    c
    .    %010d
    d
    .    %-12s
    e
    .    %x
  4. Why do certain floating-point numbers have round-off problems?
  5. How do you create a page header for an output file?

Exercises

  1. Write a program that prints the powers of 2 from 2**1 to 2**10. Use write and a print format to print them three to a line. Align the lines so that the right end of each number is lined up with the right end of the corresponding number on the previous line.
  2. Repeat Exercise 1 using printf.
  3. Write a program that reads text and formats it into 40-character lines, left-justified. Put lines of asterisks above and below the text.
  4. Write a program that reads a set of dollar values such as 71.43 (one per line). Write out two values per line (the first and second on the first line, and so on). Total each of the resulting columns, and produce a grand total.
  5. BUG BUSTER: What is wrong with the following program?
    #!/usr/local/bin/perl

    format STDOUT =
    @*
    .
    while ($line = <STDIN>) {
    chop ($line);
    if ($line eq "") {
    print ("<blank line>\n");
    next;
    }
    write;
    }

  

Chapter 10:Associative Arrays


Chapter 10

Associative Arrays

  Today's lesson shows you how to use associative arrays. You'll learn the following:
  • What an associative array is
  • How to access and create an associative array
  • How to copy to and from an associative array
  • How to add and delete associative array elements
  • How to list array indexes and values
  • How to loop using an associative array
  • How to build data structures using associative arrays
To start, take a look at some of the problems that using array variables creates. Once you have seen some of the difficulties created by array variables in certain contexts, you'll see how associative arrays can eliminate these difficulties.

Limitations of Array Variables

In the array variables you've seen so far, you can access an element of a stored list by specifying a subscript. For example, the following statement accesses the third element of the list stored in the array variable @array:
$scalar = $array[2];

The subscript 2 indicates that the third element of the array is to be referenced.
Although array variables are useful, they have one significant drawback: it's often difficult to remember which element of an array stores what. For example, suppose you want to write a program that counts the number of occurrences of each capitalized word in an input file. You can do this using array variables, but it's very difficult. Listing 10.1 shows you what you have to go through to do this.


Listing 10.1. A program that uses array variables to keep track of capitalized words in an input file.
1:  #!/usr/local/bin/perl

2:

3:  while ($inputline = <STDIN>) {

4:          while ($inputline =~ /\b[A-Z]\S+/g) {

5:                  $word = $&;

6:                  $word =~ s/[;.,:-]$//;  # remove punctuation

7:                  for ($count = 1; $count <= @wordlist;

8:                                   $count++) {

9:                          $found = 0;

10:                         if ($wordlist[$count-1] eq $word) {

11:                                 $found = 1;

12:                                 $wordcount[$count-1] += 1;

13:                                 last;

14:                         }

15:                 }

16:                 if ($found == 0) {

17:                         $oldlength = @wordlist;

18:                         $wordlist[$oldlength] = $word;

19:                         $wordcount[$oldlength] = 1;

20:                 }

21:         }

22: }

23: print ("Capitalized words and number of occurrences:\n");

24: for ($count = 1; $count <= @wordlist; $count++) {

25:         print ("$wordlist[$count-1]: $wordcount[$count-1]\n");

26: }



$ program10_1

Here is a line of Input.

This Input contains some Capitalized words.

^D

Capitalized words and number of occurrences:

Here: 1

Input: 2

This: 1

Capitalized: 1

$



This program reads one line of input at a time from the standard input file. The loop starting on line 4 matches each capitalized word in the line; the loop iterates once for each match, and it assigns the match being examined in this particular iteration to the scalar variable $word.
Once any closing punctuation has been removed by line 6, the program must then check whether this word has been seen before. Lines 7-15 do this by examining each element of the list @wordlist in turn. If an element of @wordlist is identical to the word stored in $word, the corresponding element of @wordcount is incremented.
If no element of @wordlist matches $word, lines 16-20 add a new element to @wordlist and @wordcount.

Definition

As you can see, using array variables creates several problems. First, it's not obvious which element of @wordlist in Listing 10.1 corresponds to which capitalized word. In the example shown, $wordlist[0] contains Here because this is the first capitalized word in the input file, but this is not obvious to the reader.
Worse still, the program has no way of knowing which element of @wordlist contains which word. This means that every time the program reads a new word, it has to check the entire list to see if the word has already been found. This becomes time-consuming as the list grows larger.
All of these problems with array variables exist because elements of array variables are accessed by numeric subscripts. To get around these problems, Perl defines another kind of array, which enables you to access array variables using any scalar value you like. These arrays are called associative arrays.
To distinguish an associative array variable from an ordinary array variable, Perl uses the % character as the first character of an associative array-variable name, instead of the @ character. As with other variable names, the first character following the % must be a letter, and subsequent characters can be letters, digits, or underscores.
The following are examples of associative array-variable names:
%assocarray

%a1

%my_really_long_but_legal_array_variable_name



NOTE
Use the same name for an associative array variable and an ordinary array variable. For example, you can define an array variable named @arrayname and an associative array variable named %arrayname.
The @ and % characters ensure that the Perl interpreter can tell one variable name from another.

Referring to Associative Array Elements

The main difference between associative arrays and ordinary arrays is that associative array subscripts can be any scalar value. For example, the following statement refers to an element of the associative array %fruit:
$fruit{"bananas"} = 1;

The subscript for this array element is bananas. Any scalar value can be a subscript. For example:
$fruit{"black_currant"}

$number{3.14159}

$integer{-7}

A scalar variable can be used as a subscript, as follows:
$fruit{$my_fruit}

Here, the contents of $my_fruit become the subscript into the associative array %fruit.
When an array element is referenced, as in the previous example, the name of the array element is preceded by a $ character, not the % character. As with array variables, this tells the Perl interpreter that this is a single scalar item and is to be treated as such.


NOTE
Subscripts for associative array elements are always enclosed in brace brackets ({}), not square brackets ([]). This ensures that the Perl interpreter is always able to distinguish associative array elements from other array elements.

Adding Elements to an Associative Array

The easiest way to create an associative array item is just to assign to it. For example, the statement
$fruit{"bananas"} = 1;

assigns 1 to the element bananas of the associative array %fruit. If this element does not exist, it is created. If the array %fruit has not been referred to before, it also is created.
This feature makes it easy to use associative arrays to count occurrences of items. For example, Listing 10.2 shows how you can use associative arrays to count the number of capitalized words in an input file. Note how much simpler this program is than the one in Listing 10.1, which accomplishes the same task.


Listing 10.2. A program that uses an associative array to count the number of capitalized words in a file.
1:  #!/usr/local/bin/perl

2:

3:  while ($inputline = <STDIN>) {

4:          while ($inputline =~ /\b[A-Z]\S+/g) {

5:                  $word = $&;

6:                  $word =~ s/[;.,:-]$//;  # remove punctuation

7:                  $wordlist{$word} += 1;

8:          }

9:  }

10: print ("Capitalized words and number of occurrences:\n");

11: foreach $capword (keys(%wordlist)) {

12:         print ("$capword: $wordlist{$capword}\n");

13: }



$ program10_2

Here is a line of Input.

This Input contains some Capitalized words.

^D

Capitalized words and number of occurrences:

This: 1

Input: 2

Here: 1

Capitalized: 1

$



As you can see, this program is much simpler than the one in Listing 10.1. The previous program required 20 lines of code to read input and store the counts for each word; this program requires only seven.
As before, this program reads one line of input at a time from the standard input file. The loop starting in line 4 iterates once for each capitalized word found in the input line; each match is assigned, in turn, to the scalar variable $word.
Line 7 uses the associative array %wordlist to keep track of the capitalized words. Because associative arrays can use any value as a subscript for an element, this line uses the word itself as a subscript. Then, the element of the array corresponding to the word has 1 added to its value.
For example, when the word Here is read in, the associative array element $wordlist{"Here"} has 1 added to its value.
Lines 11-13 print the elements of the associative array. Line 11 contains a call to a special built-in function, keys. This function returns a list consisting of the subscripts of the associative array; the foreach statement then loops through this list, iterating once for each element of the associative array. Each subscript of the associative array is assigned, in turn, to the local variable $capword; in this example, this means that $capword is assigned Here, Input, Capitalized, and This-one per each iteration of the for each loop.


An important fact to remember is that associative arrays always are stored in "random" order. (Actually, it's the order that ensures fastest access, but, effectively, it is random.) This means that if you use keys to access all of the elements of an associative array, there is no guarantee that the elements will appear in any given order. In particular, the elements do not always appear in the order in which they are created.

To control the order in which the associative array elements appear, use sort to sort the elements returned by keys.
foreach $capword (sort keys(%wordlist)) {

        print ("$capword: $wordlist{$capword}\n");

}

When line 10 of Listing 10.2 is modified to include a call to sort, the associative array elements appear in sorted order.

Creating Associative Arrays

You can create an associative array with a single assignment. To do this, alternate the array subscripts and their values. For example:
%fruit = ("apples", 17, "bananas", 9, "oranges", "none");

This assignment creates an associative array of three elements:
  • An element with subscript apples, whose value is 17
  • An element with subscript bananas, whose value is 9
  • An element with subscript oranges, whose value is none


Again, it is important to remember that the elements of associative arrays are not guaranteed to be in any particular order, even if you create the entire array at once.


NOTE
Perl version 5 enables you to use either => or , to separate array subscripts and values when you assign a list to an associative array. For example:
%fruit = ("apples" => 17, "bananas" => 9, "oranges" => "none");
This statement is identical to the previous one, but is easier to understand; the use of => makes it easier to see which subscript is associated with which value.

As with any associative array, you always can add more elements to the array later on. For example:
$fruit{"cherries"} = 5;

This adds a fourth element, cherries, to the associative array %fruit, and gives it the value 5.

Copying Associative Arrays from Array Variables

The list of subscripts and values assigned to %fruit in the previous example is an ordinary list like any other. This means that you can create an associative array from the contents of an array variable. For example:
@fruit = ("apples", 6, "cherries", 8, "oranges", 11);

%fruit = @fruit;

The second statement creates an associative array of three elements-apples, cherries, and oranges-and assigns it to %fruit.


If you are assigning a list or the contents of an array variable to an associative array, make sure that the list contains an even number of elements, because each pair of elements corresponds to the subscript and the value of an associative array element.

Similarly, you can copy one associative array into another. For example:
%fruit1 = ("apples", 6, "cherries", 8, "oranges", 11);

%fruit2 = %fruit1;

You can assign an associative array to an ordinary array variable in the same way. For example:
%fruit = ("grapes", 11, "lemons", 27);

@fruit = %fruit;

However, this might not be as useful, because the order of the array elements is not defined. Here, the array variable @fruit is assigned either the four-element list
("grapes", 11, "lemons", 27)

or the list
("lemons", 27, "grapes", 11)

depending on how the associative array is sorted.
You can also assign to several scalar variables and an associative array at the same time.
($var1, $var2, %myarray) = @list;

Here, the first element of @list is assigned to $var1, the second to $var2, and the rest to %myarray.
Finally, an associative array can be created from the return value of a built-in function or user-defined subroutine that returns a list. Listing 10.3 is an example of a simple program that does just that. It takes the return value from split, which is a list, and assigns it to an associative array variable.


Listing 10.3. A program that uses the return value from a built-in function to create an associative array.
1:  #!/usr/local/bin/perl

2:

3:  $inputline = <STDIN>;

4:  $inputline =~ s/^\s+|\s+\n$//g;

5:  %fruit = split(/\s+/, $inputline);

6:  print ("Number of bananas: $fruit{\"bananas\"}\n");



$ program10_3

oranges 5 apples 7 bananas 11 cherries 6

Number of bananas: 11

$



This program reads a line of input from the standard input file and eliminates the leading and trailing white space. Line 5 then calls split, which breaks the line into words. In this example, split returns the following list:
("oranges", 5, "apples", 7, "bananas", 11, "cherries", 6)

This list is then assigned to the associative array %fruit. This assignment creates an associative array with four elements:

ElementValue
oranges5
apples7
bananas11
cherries6

Line 6 then prints the value of the element bananas, which is 11.

Adding and Deleting Array Elements

As you've seen, you can add an element to an associative array by assigning to an element not previously seen, as follows:
$fruit{"lime"} = 1;

This statement creates a new element of %fruit with index lime and gives it the value 1.
To delete an element, use the built-in function delete. For example, the following statement deletes the element orange from the array %fruit:
delete($fruit{"orange"});



DO use the delete function to delete an element of an associative array; it's the only way to delete elements.
DON'T use the built-in functions push, pop, shift, or splice with associative arrays because the position of any particular element in the array is not guaranteed.

Listing Array Indexes and Values

As you saw in Listing 10.2, the keys function retrieves a list of the subscripts used in an associative array. The following is an example:
%fruit = ("apples", 9,

           "bananas", 23,

           "cherries", 11);

@fruitsubs = keys(%fruits);

Here, @fruitsubs is assigned the list consisting of the elements apples, bananas, and cherries. Note once again that this list is in no particular order. To retrieve the list in alphabetical order, use sort on the list.
@fruitindexes = sort keys(%fruits));

This produces the list ("apples", "bananas", "cherries").
To retrieve a list of the values stored in an associative array, use the built-in function values. The following is an example:
%fruit = ("apples", 9,

           "bananas", 23,

           "cherries", 11);

@fruitvalues = values(%fruits);

Here, @fruitvalues contains the list (9, 23, 11), not necessarily in this order.

Looping Using an Associative Array

As you've seen, you can use the built-in function keys with the foreach statement to loop through an associative array. The following is an example:
%records = ("Maris", 61, "Aaron", 755, "Young", 511);

foreach $holder (keys(%records)) {

        # stuff goes here

}

The variable $holder is assigned Aaron, Maris, and Young on successive iterations of the loop (although not necessarily in that order).
This method of looping is useful, but it is inefficient. To retrieve the value associated with a subscript, the program must look it up in the array again, as follows:
foreach $holder (keys(%records)) {

        $record = %records{$holder};

}

Perl provides a more efficient way to work with associative array subscripts and their values, using the built-in function each, as follows:
%records = ("Maris", 61, "Aaron", 755, "Young", 511);

while (($holder, $record) = each(%records)) {

        # stuff goes here

}

Every time the each function is called, it returns a two-element list. The first element of the list is the subscript for a particular element of the associative array. The second element is the value associated with that particular subscript.
For example, the first time each is called in the preceding code fragment, the pair of scalar variables ($holder, $record) is assigned one of the lists ("Maris", 61), ("Aaron", 755), or ("Young", 511). (Because associative arrays are not stored in any particular order, any of these lists could be assigned first.) If ("Maris", 61) is returned by the first call to each, Maris is assigned to $holder and 61 is assigned to $record.
When each is called again, it assigns a different list to the pair of scalar variables specified. Subsequent calls to each assign further lists, and so on until the associative array is exhausted. When there are no more elements left in the associative array, each returns the empty list.


NOTE
Don't add a new element to an associative array or delete an element from it if you are using the each statement on it. For example, suppose you are looping through the associative array %records using the following loop:
while (($holder, $record) = each(%records)) {
# code goes here
}
Adding a new record to %records, such as
$records{"Rose"} = 4256;
or deleting a record, as in
delete $records{"Cobb"};
makes the behavior of each unpredictable. This should be avoided.

Creating Data Structures Using Associative Arrays

You can use associative arrays to simulate a wide variety of data structures found in high-level programming languages. This section describes how you can implement the following data structures in Perl using associative arrays:
  • Linked lists
  • Structures
  • Trees
  • Databases


NOTE
The remainder of today's lesson describes applications of associative arrays but does not introduce any new features of Perl. If you are not interested in applications of associative arrays, you can skip to the next chapter without suffering any loss of general instruction.

Linked Lists

A linked list is a simple data structure that enables you to store items in a particular order. Each element of the linked list contains two fields:
  • The value associated with this element
  • A reference, or pointer, to the next element in the list
Also, a special header variable points to the first element in the list.
Pictorially, a linked list can be represented as in Figure 10.1. As you can see, each element of the list points to the next.
Figure 10.1: A linked list.
In Perl, a linked list can easily be implemented using an associative array because the value of one associative array element can be the subscript for the next. For example, the following associative array is actually a linked list of words in alphabetical order:
%words = ("abel", "baker",

          "baker", "charlie",

          "charlie", "delta",

          "delta", "");

$header = "abel";

In this example, the scalar variable $header contains the first word in the list. This word, abel, is also the subscript of the first element of the associative array. The value of the first element of this array, baker, is the subscript for the second element, and so on, as illustrated in Figure 10.2.
Figure 10.2: A linked list of words in alphabetical order.
The value of the last element of the subscript, delta, is the null string. This indicates the end of the list.
Linked lists are most useful in applications where the amount of data to be processed is not known, or grows as the program is executed. Listing 10.4 is an example of one such application. It uses a linked list to print the words of a file in alphabetical order.


Listing 10.4. A program that uses an associative array to build a linked list.
1:  #!/usr/local/bin/perl

2:

3:  # initialize list to empty

4:  $header = "";

5:  while ($line = <STDIN>) {

6:          # remove leading and trailing spaces

7:          $line =~ s/^\s+|\s+$//g;

8:          @words = split(/\s+/, $line);

9:          foreach $word (@words) {

10:                 # remove closing punctuation, if any

11:                 $word =~ s/[.,;:-]$//;

12:                 # convert all words to lower case

13:                 $word =~ tr/A-Z/a-z/;

14:                 &add_word_to_list($word);

15:         }

16: }

17: &print_list;

18:

19: sub add_word_to_list {

20:         local($word) = @_;

21:         local($pointer);

22:

23:         # if list is empty, add first item

24:         if ($header eq "") {

25:                 $header = $word;

26:                 $wordlist{$word} = "";

27:                 return;

28:         }

29:         # if word identical to first element in list,

30:         # do nothing

31:         return if ($header eq $word);

32:         # see whether word should be the new

33:         # first word in the list

34:         if ($header gt $word) {

35:                 $wordlist{$word} = $header;

36:                 $header = $word;

37:                 return;

38:         }

39:         # find place where word belongs

40:         $pointer = $header;

41:         while ($wordlist{$pointer} ne "" &&

42:                 $wordlist{$pointer} lt $word) {

43:                 $pointer = $wordlist{$pointer};

44:         }

45:         # if word already seen, do nothing

46:         return if ($word eq $wordlist{$pointer});

47:         $wordlist{$word} = $wordlist{$pointer};

48:         $wordlist{$pointer} = $word;

49: }

50:

51: sub print_list {

52:         local ($pointer);

53:         print ("Words in this file:\n");

54:         $pointer = $header;

55:         while ($pointer ne "") {

56:                 print ("$pointer\n");

57:                 $pointer = $wordlist{$pointer};

58:         }

59: }



$ program10_4

Here are some words.

Here are more words.

Here are still more words.

^D

Words in this file:

are

here

more

some

still

words

$



The logic of this program is a little complicated, but don't despair. Once you understand how this works, you have all the information you need to build any data structure you like, no matter how complicated.
This program is divided into three parts, as follows:
  • The main program, which reads input and transforms it into the desired format
  • The subroutine add_word_to_list, which builds the linked list of sorted words
  • The subroutine print_list, which prints the list of words
Lines 3-17 contain the main program. Line 4 initializes the list of words by setting the header variable $header to the null string. The loop beginning in line 5 reads one line of input at a time. Line 7 removes leading and trailing spaces from the line, and line 8 splits the line into words.
The inner foreach loop in lines 9-15 processes one word of the input line at a time. If the final character of a word is a punctuation character, line 11 removes it; this ensures that, for example, word. (word with a period) is considered identical to word (without a period). Line 13 converts the word to all lowercase characters, and line 14 passes the word to the subroutine add_word_to_list.
This subroutine first executes line 24, which checks whether the linked list of words is empty. If it is, line 25 assigns this word to $header, and line 26 creates the first element of the list, which is stored in the associative array %wordlist. In this example, the first word read in is here (Here converted to lowercase), and the list looks like Figure 10.3.
Figure 10.3: The linked list with one element in it.
At this point, the header variable $header contains the value here, which is also the subscript for the element of %wordlist that has just been created. This means that the program can reference %wordlist by using $header as a subscript, as follows:
$wordlist{$header}

Variables such as $header that contain a reference to another data item are called pointers. Here, $header points to the first element of %wordlist.
If the list is not empty, line 31 checks whether the first item of the list is identical to the word currently being checked. To do this, it compares the current word to the contents of $header, which is the first item in the list. If the two are identical, there is no need to add the new word to the list, because it is already there; therefore, the subroutine returns without doing anything.
The next step is to check whether the new word should be the first word in the list, which is the case if the new word is alphabetically ahead of the existing first word. Line 34 checks this.
If the new word is to go first, the list is adjusted as follows:
  1. A new list element is created. The subscript of this element is the new word, and its value is the existing first word.
  2. The new word is assigned to the header variable.
To see how this adjustment works, consider the sample input provided. In this example, the second word to be processed is are. Because are belongs before here, the array element $wordlist{"are"} is created, and is given the value here. The header variable $header is assigned the value are. This means the list now looks like Figure 10.4.
Figure 10.4: The linked list with two elements in it.
The header variable $header now points to the list element with the subscript are, which is $wordlist{"are"}. The value of $wordlist{"are"} is here, which means that the program can access $wordlist{"here"} from $wordlist{"are"}. For example:
$reference = $wordlist{"are"};

print ("$wordlist{$reference}\n");

The value here is assigned to $reference, and print prints $wordlist{$reference}, which is $wordlist{"here"}.
Because you can access $wordlist{"here"} from $wordlist{"are"}, $wordlist{"are"} is a pointer to $wordlist{"here"}.
If the word does not belong at the front of the list, lines 40-44 search for the place in the list where the word does belong, using a local variable, $pointer. Lines 41-44 loop until the value stored in $wordlist{$pointer} is greater than or equal to $word. For example, Figure 10.5 illustrates where line 42 is true when the subroutine processes more.
Figure 10.5: The linked list when more is processed.
Note that because the list is in alphabetical order, the value stored in $pointer is always less than the value stored in $word.
If the word being added is greater than any word in the list, the conditional expression in line 41 eventually becomes true. This occurs, for example, when the subroutine processes some, as in Figure 10.6.
Figure 10.6: The linked list when some is processed.
Once the location of the new word has been determined, line 46 checks whether the word already is in the list. If it is, there is no need to do anything.
If the word does not exist, lines 47 and 48 add the word to the list. First, line 47 creates a new element of %wordlist, which is $wordlist{$word}; its value is the value of $wordlist{$pointer}. This means that $wordlist{$word} and $wordlist{$pointer} now point to the same word, as in Figure 10.7.
Figure 10.7: The linked list as a new word is being added.
Next, line 48 sets the value of $wordlist{$pointer} to the value stored in $word. This means that $wordlist{$pointer} now points to the new element, $wordlist{$word}, that was just created, as in Figure 10.8.
Figure 10.8: The linked list after the new word is added.
Once the input file has been completely processed, the subroutine print_list prints the list, one element at a time. The local variable $pointer contains the current value being printed, and $wordlist{$pointer} contains the next value to be printed.


NOTE
Normally, you won't want to use a linked list in a program. It's easier just to use sort and keys to loop through an associative array in alphabetical order, as follows:
foreach $word (sort keys(%wordlist)) {
# print the sorted list, or whatever
}
However, the basic idea of a pointer, which is introduced here, is useful in other data structures, such as trees, which are described later on.

Structures

Many programming languages enable you to define collections of data called structures. Like lists, structures are collections of values; each element of a structure, however, has its own name and can be accessed by that name.
Perl does not provide a way of defining structures directly. However, you can simulate a structure using an associative array. For example, suppose you want to simulate the following variable definition written in the C programming language:
struct {

        int field1;

        int field2;

        int field3;

} mystructvar;

This C statement defines a variable named mystructvar, which contains three elements, named field1, field2, and field3.
To simulate this using an associative array, all you need to do is define an associative array with three elements, and set the subscripts for these elements to field1, field2, and field3. The following is an example:
%mystructvar = ("field1", "",

             "field2", "",

             "field3", "");

Like the preceding C definition, this associative array, named %mystructvar, has three elements. The subscripts for these elements are field1, field2, and field3. The definition sets the initial values for these elements to the null string.
As with any associative array, you can reference or assign the value of an element by specifying its subscript, as follows:
$mystructvar{"field1"} = 17;

To define other variables that use the same "structure," all you need to do is create other arrays that use the same subscript names.

Trees

Another data structure that is often used in programs is a tree. A tree is similar to a linked list, except that each element of a tree points to more than one other element.
The simplest example of a tree is a binary tree. Each element of a binary tree, called a node, points to two other elements, called the left child and the right child. Each of these children points to two children of its own, and so on, as illustrated in Figure 10.9.
Figure 10.9: A binary tree.
Note that the tree, like a linked list, is a one-way structure. Nodes point to children, but children don't point to their parents.
The following terminology is used when describing trees:
  • Because each of the children of a node is a tree of its own, the left child and the right child are often called the left subtree and the right subtree of the node. (The terms left branch and right branch are also used.)
  • The "first" node of the tree (the node that is not a child of another node), is called the root of the tree.
  • Nodes that have no children are called leaf nodes.
There are several ways of implementing a tree structure using associative arrays. To illustrate one way of doing so, suppose that you wish to create a tree whose root has the value alpha and whose children have the values beta and gamma, as in Figure 10.10.
Figure 10.10: A binary tree with three nodes.
Here, the left child of alpha is beta, and the right child of alpha is gamma.
The problem to be solved is this: How can a program associate both beta and gamma with alpha? If the associative array that is to represent the tree is named %tree, do you assign the value of $tree{"alpha"} to be beta, or gamma, or both? How do you show that an element points to two other elements?
There are several solutions to this problem, but one of the most elegant is as follows: Append the character strings left and right, respectively, to the name of a node in order to retrieve its children. For example, define alphaleft to point to beta and alpharight to point to gamma. In this scheme, if beta has children, betaleft and betaright point to their locations; similarly, gammaleft and gammaright point to the locations of the children of gamma, and so on.
Listing 10.5 is an example of a program that creates a binary tree using this method and then traverses it (accesses every node in the tree).


Listing 10.5. A program that uses an associative array to represent a binary tree.
1:  #!/usr/local/bin/perl

2:

3:  $rootname = "parent";

4:  %tree = ("parentleft", "child1",

5:           "parentright", "child2",

6:           "child1left", "grandchild1",

7:           "child1right", "grandchild2",

8:           "child2left", "grandchild3",

9:           "child2right", "grandchild4");

10: # traverse tree, printing its elements

11: &print_tree($rootname);

12:

13: sub print_tree {

14:         local ($nodename) = @_;

15:         local ($leftchildname, $rightchildname);

16:

17:         $leftchildname = $nodename . "left";

18:         $rightchildname = $nodename . "right";

19:         if ($tree{$leftchildname} ne "") {

20:                 &print_tree($tree{$leftchildname});

21:         }

22:         print ("$nodename\n");

23:         if ($tree{$rightchildname} ne "") {

24:                 &print_tree($tree{$rightchildname});

25:         }

26: }



$ program10_5

grandchild1

child1

grandchild2

parent

grandchild3

child2

grandchild4

$



This program creates the tree depicted in Figure 10.11.
Figure 10.11: The tree created by Listing 10.5.
The associative array %tree stores the tree, and the scalar variable $rootname holds the name of the root of the tree. (Note that the grandchild nodes, such as grandchild1, are leaf nodes. There is no need to explicitly create grandchild1left, grandchild1right, and so on because the value of any undefined associative array element is, by default, the null string.)
After the tree has been created, the program calls the subroutine print_tree to traverse it and print its values. print_tree does this as follows:
  1. Line 17 appends left to the name of the node being examined to produce the name of the left child, which is stored in $leftchildname. For example, if the root node, parent, is being examined, the value stored in $leftchildname is parentleft.
  2. Similarly, line 18 appends right to the node name and stores the result in $rightchildname.
  3. Line 19 checks whether the current node has a left child, which is true if $tree{$leftchildname} is defined. (For example, parent has a left child, because $tree{"parentleft"} is defined.) If the current node has a left child, line 20 recursively calls print_tree to print the left subtree (the left child and its children).
  4. Line 22 prints the name of the current node.
  5. Line 23 checks whether the current node has a right child. If it does, line 24 recursively calls print_tree to print the right subtree.
Note that print_tree prints the names of the nodes of the tree in the following order: left subtree, node, right subtree. This order of traversal is called infix mode or infix traversal. If you move line 22 to precede line 19, the node is printed first, followed by the left subtree and the right subtree; this order of traversal is called prefix mode. If you move line 22 to follow line 25, the node is printed after the subtrees are printed; this is called postfix mode.

Databases

As you have seen, you can build a tree using an associative array. To do this, you build the associative array subscripts by joining character strings together (such as joining the node name and "left"). You can use this technique of joining strings together to use associative arrays to build other data structures.
For example, suppose you want to create a database that contains the lifetime records of baseball players. Each record is to consist of the following:
  • For non-pitchers, a record consists of games played (GP), home runs (HR), runs batted in (RBI) and batting average (AVG). For example, the record on Lou Gehrig would read as follows:
    Gehrig: 2164 GP, 493 HR, 1991 RBI, .340 BA
  • For pitchers, a record consists of games pitched (GP), wins (W), and earned run average (ERA). For example, the record on Lefty Grove would read as follows:
    Grove: 616 GP, 300 W, 3.05 ERA
To create a database containing player and pitcher records, you need the following fields:
  • A name field, for the player's name
  • A key indicating whether the player was a pitcher
  • The fields defined above
You can use an associative array to simulate this in Perl. To do this, build the subscripts for the associative array by concatenating the name of the player with the name of the field being stored by this element of the array. For example, if the associative array is named %playerbase, $playerbase{"GehrigRBI"}, it contains the career RBI total for Lou Gehrig.
Listing 10.6 shows how to build a player database and how to sequentially print fields from each of the player records.


Listing 10.6. A program that builds and prints a database.
1:  #!/usr/local/bin/perl

2:

3:  @pitcherfields = ("NAME", "KEY", "GP", "W", "ERA");

4:  @playerfields = ("NAME", "KEY", "GP", "HR", "RBI", "BA");

5:

6:  # Build the player database by reading from standard input.

7:  # %playerbase contains the database, @playerlist the list of

8:  # players (for later sequential access).

9:  $playercount = 0;

10: while ($input = <STDIN>) {

11:         $input =~ s/^\s+|\s+$//g;

12:         @words = split (/\s+/, $input);

13:         $playerlist[$playercount++] = $words[0];

14:         if ($words[1] eq "player") {

15:                 @fields = @playerfields;

16:         } else {

17:                 @fields = @pitcherfields;

18:         }

19:         for ($count = 1; $count <= @words; $count++) {

20:                 $playerbase{$words[0].$fields[$count-1]} =

21:                         $words[$count-1];

22:         }

23: }

24:

25: # now, print out pitcher win totals and player home run totals

26: foreach $player (@playerlist) {

27:         print ("$player: ");

28:         if ($playerbase{$player."KEY"} eq "player") {

29:                 $value = $playerbase{$player."HR"};

30:                 print ("$value home runs\n");

31:         } else {

32:                 $value = $playerbase{$player."W"};

33:                 print ("$value wins\n");

34:         }

35: }



$ program10_6

Gehrig        player      2164    493    1991   .340

Ruth          player      2503    714    2217   .342

Grove         pitcher     616     300    3.05

Williams      player      2292    521    1839   .344

Koufax        pitcher     397     165    2.76

^D

Gehrig: 493 home runs

Ruth: 714 home runs

Grove: 300 wins

Williams: 521 home runs

Koufax: 165 wins

$


This program has been designed so that it is easy to add new fields to the database. With this in mind, lines 3 and 4 define the fields that are to be used when building the player and pitcher records.
Lines 9-23 build the database. First, line 9 initializes $playercount to 0; this global variable keeps track of the number of players in the database.
Lines 10-12 read a line from the standard input file, check whether the file is empty, remove leading and trailing white space from the line, and split the line into words.
Line 13 adds the player name (the first word in the input line) to the list of player names stored in @playerlist. The counter $playercount then has 1 added to it; this reflects the new total number of players stored in the database.
Lines 14-18 determine whether the new player is a pitcher or not. If the player is a pitcher, the names of the fields to be stored in this player record are to be taken from @pitcherfields; otherwise, the names are to be taken from @playerfields. To simplify processing later on, another array variable, @fields, is used to store the list of fields actually being used for this player.
Lines 19-22 copy the fields into the associative array, one at a time. Each array subscript is made up of two parts: the name of the player and the name of the field being stored. For example, Sandy Koufax's pitching wins are stored in the array element KoufaxW. Note that neither the player name nor the field names appear in this loop; this means that you can add new fields to the list of fields without having to change this code.
Lines 26-35 now search the database for all the win and home run totals just read in. Each iteration of the foreach loop assigns a different player name to the local variable $player. Line 28 examines the contents of the array element named $player."KEY" to determine whether the player is a pitcher.
If the player is not a pitcher, lines 29-30 print out the player's home-run total by accessing the array element $player."HR". If the player is a pitcher, the pitcher's win total is printed out by lines 32-33; these lines access the array element $player."W".
Note that the database can be accessed randomly as well as sequentially. To retrieve, for example, Babe Ruth's lifetime batting average, you would access the array element $playerbase{"RuthAVG"}. If the record for a particular player is not stored in the database, attempting to access it will return the null string. For example, the following assigns the null string to $cobbavg because Ty Cobb is not in the player database:
$cobbavg = $playerbase{"CobbAVG"};

As you can see, associative arrays enable you to define databases with variable record lengths, accessible either sequentially or randomly. This gives you all the flexibility you need to use Perl as a database language.

Example: A Calculator Program

Listing 10.7 provides an example of what you can do with associative arrays and recursive subroutines. This program reads in an arithmetic expression, possibly spread over several lines, and builds a tree from it. The program then evaluates the tree and prints the result. The operators supported are +, -, *, /, and parentheses (to force precedence).
This program is longer and more complicated than the programs you have seen so far, but stick with it. Once you understand this program, you will know enough to be able to write an entire compiler in Perl!


Listing 10.7. A calculator program that uses trees.
1:  #!/usr/local/bin/perl

2:  # statements which initialize the program

3:  $nextnodenum = 1;  # initialize node name generator

4:  &get_next_item;    # read first value from file

5:  $treeroot = &build_expr;

6:  $result = &get_result ($treeroot);

7:  print ("the result is $result\n");

8:  # Build an expression.

9:  sub build_expr {

10:         local ($currnode, $leftchild, $rightchild);

11:         local ($operator);

12:         $leftchild = &build_add_operand;

13:         if (&is_next_item("+") || &is_next_item("-")) {

14:                 $operator = &get_next_item;

15:                 $rightchild = &build_expr;

16:                 $currnode = &get_new_node ($operator,

17:                         $leftchild, $rightchild);

18:         } else {

19:                 $currnode = $leftchild;

20:         }

21: }

22: # Build an operand for a + or - operator.

23: sub build_add_operand {

24:         local ($currnode, $leftchild, $rightchild);

25:         local ($operator);

26:         $leftchild = &build_mult_operand;

27:         if (&is_next_item("*") || &is_next_item("/")) {

28:                 $operator = &get_next_item;

29:                 $rightchild = &build_add_operand;

30:                 $currnode = &get_new_node ($operator,

31:                         $leftchild, $rightchild);

32:        } else {

33:                 $currnode = $leftchild;

34:        }

35: }

36: # Build an operand for the * or / operator.

37: sub build_mult_operand {

38:         local ($currnode);

39:         if (&is_next_item("(")) {

40:                 # handle parentheses

41:                &get_next_item;  # get rid of "("

42:                 $currnode = &build_expr;

43:                 if (! &is_next_item(")")) {

44:                         die ("Invalid expression");

45:                 }

46:                 &get_next_item;  # get rid of ")"

47:         } else {

48:                 $currnode = &get_new_node(&get_next_item,

49:                             "", "");

50:        }

51:        $currnode;  # ensure correct return value

52: }

53: # Check whether the last item read matches

54: # a particular operator.

55: sub is_next_item {

56:         local ($expected) = @_;

57:         $curritem eq $expected;

58: }

59: # Return the last item read; read another item.

60: sub get_next_item {

61:         local ($retitem);

62:         $retitem = $curritem;

63:         $curritem = &read_item;

64:         $retitem;

65: }

66: # This routine actually handles reading from the standard

67: # input file.

68: sub read_item {

69:         local ($line);

70:         if ($curritem eq "EOF") {

71:                 # we are already at end of file; do nothing

72:                 return;

73:         }

74:         while ($wordsread == @words) {

75:                 $line = <STDIN>;

76:                 if ($line eq "") {

77:                         $curritem = "EOF";

78:                         return;

79:                 }

80:                 $line =~ s/\(/ ( /g;

81:                 $line =~ s/\)/ ) /g;

82:                 $line =~ s/^\s+|\s+$//g;

83:                 @words = split(/\s+/, $line);

84:                 $wordsread = 0;

85:         }

86:         $curritem = $words[$wordsread++];

87: }

88: # Create a tree node.

89: sub get_new_node {

90:         local ($value, $leftchild, $rightchild) = @_;

91:         local ($nodenum);

92:         $nodenum = $nextnodenum++;

93:         $tree{$nodenum} = $value;

94:         $tree{$nodenum . "left"} = $leftchild;

95:         $tree{$nodenum . "right"} = $rightchild;

96:         $nodenum;   # return value

97: }

98: # Calculate the result.

99: sub get_result {

100:         local ($node) = @_;

101:         local ($nodevalue, $result);

102:        $nodevalue = $tree{$node};

103:        if ($nodevalue eq "") {

104:                die ("Bad tree");

105:        } elsif ($nodevalue eq "+") {

106:                $result = &get_result($tree{$node . "left"}) +

107:                          &get_result($tree{$node . "right"});

108:        } elsif ($nodevalue eq "-") {

109:                $result = &get_result($tree{$node . "left"}) -

110:                          &get_result($tree{$node . "right"});

111:        } elsif ($nodevalue eq "*") {

112:                $result = &get_result($tree{$node . "left"}) *

113:                          &get_result($tree{$node . "right"});

114:        } elsif ($nodevalue eq "/") {

115:                $result = &get_result($tree{$node . "left"}) /

116:                          &get_result($tree{$node . "right"});

117:        } elsif ($nodevalue =~ /^[0-9]+$/) {

118:                $result = $nodevalue;

119:        } else {

120:                die ("Bad tree");

121:        }

122:}



$ program10_7

11 + 5 *

(4 - 3)

^D

the result is 16

$



This program is divided into two main parts: a part that reads the input and produces a tree, and a part that calculates the result by traversing the tree.
The subroutines build_expr, build_add_operand, and build_mult_operand build the tree. To see how they do this, first look at Figure 10.12 to see what the tree for the example, 11 + 5 * (4 - 3), should look like.
Figure 10.12: The tree for the example in Listing 10.7.
When this tree is evaluated, the nodes are searched in postfix order. First, the left subtree of the root is evaluated, then the right subtree, and finally the operation at the root.
The rules followed by the three subroutines are spelled out in the following description:
  1. An expression consists of one of the following:
    1. An add_operand
    2. An add_operand, a + or - operator, and an expression

  1. An add_operand consists of one of the following:
    1. A mult_operand
    2. A mult_operand, a * or / operator, and an add_operand

  1. A mult_operand consists of one of the following:
    1. A number (a group of digits)
    2. An expression enclosed in parentheses
The subroutine build_expr handles all occurrences of condition 1; it is called (possibly recursively) whenever an expression is processed. Condition 1a covers the case in which the expression contains no + or - operators (unless they are enclosed in parentheses). Condition 1b handles expressions that contain one or more + or - operators.
The subroutine build_add_operand handles condition 2; it is called whenever an add_operand is processed. Condition 2a covers the case in which the add operand contains no * or / operators (except possibly in parentheses). Condition 2b handles add operands that contain one or more * or / operators.
The subroutine build_mult_operand handles condition 3 and is called whenever a mult_operand is processed. Condition 3a handles multiplication operands that consist of a number. Condition 3b handles multiplication operators that consist of an expression in parentheses; to obtain the subtree for this expression, build_mult_operand calls build_expr recursively and then treats the returned subtree as a child of the node currently being built.
Note that the tree built by build_expr, build_mult_operand, and build_add_operand is slightly different from the tree you saw in Listing 10.5. In that tree, the value of the node could also be used as the subscript into the associative array. In this tree, the value of the node might not be unique. To get around this problem, a separate counter creates numbers for each node, which are used when building the subscripts. For each node numbered n (where n is an integer), the following are true:
  • $tree{n} contains the value of the node, which is the number or operator associated with the node.
  • $tree{n."left"} contains the number of the left child of the node.
  • $tree{n."right"} contains the number of the right child of the node.
The subroutines is_next_item, get_next_item, and read_item read the input from the standard input file and break it into numbers and operators. The subroutine get_next_item "pre-reads" the next item and stores it in the global variable $curritem; this lets is_next_item check whether the next item to be read matches a particular operator. To read an item, get_next_item calls read_item, which reads lines of input, breaks them into words, and returns the next word to be read.
The subroutine get_new_node creates a tree node. To do this, it uses the contents of the global variable $nextnodenum to build the associative array subscripts associated with the node. $nextnodenum always contains a positive integer n, which means that the value associated with this node (which is a number or operator) is stored in $tree{n}. The location of the left and right children, if any, are stored in $tree{n."left"} and $tree {n."right"}.
The subroutine get_result traverses the tree built by build_expr in postfix order (subtrees first), performing the arithmetic operations as it does so. get_result returns the final result, which is then printed.
Note that the main part of this program is only eight lines long! This often happens in more complex programs. The main part of the program just calls subroutines, and the subroutines do the actual work.


NOTE
This program is just the tip of the iceberg: you can use associative arrays to simulate any data structure in any programming language.

Summary

In today's lesson, you learned about associative arrays, which are arrays whose subscripts can be any scalar value.
You can copy a list to an associative array, provided there is an even number of elements in the list. Each pair of elements in the list is treated as an associated array subscript and its value. You also can copy an associative array to an ordinary array variable.
To add an element to an associative array, just assign a value to an element whose subscript has not been previously seen. To delete an element, call the built-in function delete.
The following three built-in functions enable you to use associative arrays with foreach loops:
  • The built-in function keys retrieves each associative array subscript in turn.
  • The built-in function values retrieves each associative array value in turn.
  • The built-in function each retrieves each subscript-value pair in turn (as a two-element list).
Associative arrays are not guaranteed to be stored in any particular order. To guarantee a particular order, use sort with keys, values, or each.
Associative arrays can be used to simulate a wide variety of data structures, including linked lists, structures, trees, and databases.

Q&A

Q:Are pointers implemented in Perl?
A:Yes, if you are using Perl 5; they are discussed on Day 18, "References in Perl 5." Perl 4 does not support pointers.
Q:How can I implement more complicated data structures using associative arrays?
A:All you need to do is design the structure you want to implement, name each of the fields in the structure, and use the name-concatenation trick to build your associative array subscript names.
Q:What do I do if I want to build a tree that has multiple values at each node?
A:There are many ways to do this. One way is to append value1, value2, and so on, to the name of each node; for example, if the node is named n7, n7value1 could be the associative array subscript for the first value associated with the node, n7value2 could be the subscript for the second, and so on.
Q:What do I do if I want to build a tree that has more than two children per node?
A:Again, there are many ways to do this. A possible solution is to use child1, child2, child3, and so on, instead of left and right.
Q:How do I destroy a tree that I have created?
A:To destroy a tree, write a subroutine that traverses the tree in postfix order (subtrees first). Destroy each subtree (by recursively calling your subroutine), and then destroy the node you are looking at by calling delete.
Note that you shouldn't use keys or each to access each element of the loop before deleting it. Deleting an element affects how the associative array is stored, which means that keys and each might not behave the way you want them to.
If you want to destroy the entire associative array in which the tree is stored, you can use the undef function, which is described on Day 14, "Scalar-Conversion and List-Manipulation Functions."

Workshop

The Workshop provides quiz questions to help you solidify your understanding of the material covered and exercises to give you experience in using what you've learned. Try and understand the quiz and exercise answers before you go on to tomorrow's lesson.

Quiz

  1. Define the following terms:
    1. associative array
    2. pointer
    3. linked list
    4. binary tree
    5. node
    6. child

  1. What are the elements of the associative array created by the following assignment?
    %list = ("17.2", "hello", "there", "46", "e+6", "88");
  2. What happens when you assign an associative array to an ordinary array variable?
  3. How can you create a linked list using an associative array?
  4. How many times does the following loop iterate?
    %list = ("first", "1", "second", "2", "third", "3");
    foreach $word (keys(%list)) {
    last if ($word == "second");
    }

Exercises

  1. Write a program that reads lines of input consisting of two words per line, such as
    bananas 16
    and creates an associative array from these lines. (The first word is to be the subscript and the second word the value.)
  2. Write a program that reads a file and searches for lines of the form
    index word
    where word is a word to be indexed. Each indexed word is to be stored in an associative array, along with the line number on which it first occurs. (Subsequent occurrences can be ignored.) Print the resulting index.
  3. Modify the program created in Exercise 2 to store every occurrence of each index line. (Hint: Try building the associative array subscripts using the indexed word, a non-printable character, and a number.) Print the resulting index.
  4. Write a program that reads input lines consisting of a student name and five numbers representing the student's marks in English, history, mathematics, science, and geography, as follows:
    Jones 61 67 75 80 72
    Use an associative array to store these numbers in a database, and then print out the names of all students with failing grades (less than 50) along with the subjects they failed.
  5. BUG BUSTER: What is wrong with the following code?
    %list = ("Fred", 61, "John", 72,
    "Jack", 59, "Mary", 80);
    $surname = "Smith";
    foreach $firstname (keys (%list)) {
    %list{$firstname." ".$surname} = %list{$firstname};
    }