Perl file manipulation continued and introducing regular expressions (regex).

The last and possibly most useful thing I wish to show you is regular expressions, (henceforth to be referred to as regex) I'll also call it "Perl string manipulation" for the benefit of the search engines spidering this page. :-).

Regex is basically "pattern matching" the ability to look through vast tomes of data, to find patterns.. and do stuff with data that matches.

There are two main forms for regex, and they are "match" and "substitute" and they look like this:

   # Just match on a pattern.
m/pattern/;

and

   # Match on pattern, and replace any matches found with 'replacement'
s/pattern/replacement/;

(Strangly enough, the "m" stands for "match" and the "s" for "substitute". :-)

A real world example.
Here you are going to see not only the use of file "open", you will also see an example of a "foreach" loop and an "if" statement, and some "regex" to boot. all in one little code sample. (told you the boring stuff was useful.)

Say you opened the file as detailed in the previous primer and you wanted to look through every line of the file's data looking for some sort of pattern, Here is what you would do:

   # Set the file path and name.
my $data_file = '/var/www/cgi-bin/mydata/data.txt';

   # Open the file for reading.
open DATA, "$data_file" or die "can't open $data_file $!";
my @array_of_data = <DATA>;
close (DATA);

now @array_of_data contains all the text in that file, and you can manipulate it as you please.

so $array_of_data{0} would be the first line of text in the file, $array_of_data{1} would be the second line and so on.

NOTE: It should be noted that if the file in question is really big, its not a good idea to pull it all into an array. say for example you did this with server logfile of say 50,000 lines long. The chances are good that you would either time out the CGI process, or chew up the servers available memory, or both. None of which is a good idea. :-) in cases like this, its better to open the file and loop over the filehandle.
For example:

my $data_file = '/var/www/cgi-bin/mydata/data.txt';

open DATA, "$data_file" or die "can't open $data_file $!";
while (<DATA>)
      {
   # any action here will be applied to each line of the file.
      }
close (DATA);

There is also something called "slurping" which tells Perl to grab the whole contents of a file in one hit, that would allow you to put the entire content of a file into a single scalar variable. As with the array though, this shouldn't be done on big files as it will chew a whole heap of memory.
I've also not discussed file locking, which is an important consideration for scripts that may be accessed by more then one person simultaneously.
File locking is discussed reasonably well here: About.com Perl file locking.

Getting back to the initial example, where the content is pulled into an array, if we wanted to search that data for the word "dangerous" we would use a foreach loop and some regex to do it. like so:

# start foreach loop, and assign each line,
# one at a time to the variable $line.
foreach my $line (@array_of_data)
{
   # Start an if statement, the condition of which is
        # "If this particular line contains the word dangerous."
     if ($line =~ m/dangerous/i;)
        {
        # If the line contains "dangerous" then print the line out.
        print "This line contains the word dangerous: $line\n";
        } # End the if condition here.
} # End the foreach loop here.

It reads just like English.. and it says: open the file, make each line of the file an element in the array. Then loop through the array, picking out each item (line from the file), one at a time and assign it to $line. Then check if $line contains a match with the word "dangerous" and if it does, then print out that line. Then move onto the next line of the file, again assign it to $line, and check it for "dangerous" as well.. and keep doing that till you reach the last line of the file, then exit the loop.

Not too hard is it?

Now lets do something even more useful, lets manipulate the string contained in $line so that all instances of the word "dangerous" are replaced with the word "safe".
By doing that in the foreach loop, we can do that for the contents of the entire file.
(since its stored line for line in the array.)

We do this in much the same way as the previous example.. except we don't need the "if" statement anymore.

foreach my $line (@array_of_data)
{
  # Use substitute regex to replace "dangerous"
  # with the word "safe"
 $line =~ s/dangerous/safe/gi;
}

Now you have that entire file's contents in the array, except now the word "dangerous", has been replaced with the word "safe". and the actual work was all done with one short line of code. ($line =~ s/dangerous/safe/gi;)

If you noticed the gi at the end of the match pattern, you might be wondering what they are for.. the "g" tells the pattern to be greedy (it means global).. in other words, don't stop at the first match, keep looking throughout the rest of the line in case there are more matches.. and the "i" tells the pattern to not worry about the character case, so it will match DANGEROUS, dangerous, DANgerous etc.. # The g isn't necessary in the first example, as we only need to know if the word is in the line once.. and we print it if it is., The second example needs the g because "dangerous" might be on a line more then once,, and we want to change all of them to "safe", so by adding the g, Perl will keep looking for more matches on that line. (ie Perl will get greedy and want then all.)

So, we have opened the file, read it with a loop, and changed the word dangerous for the word safe.. How about we now write the new revised text back to the old file replacing the old version. we open the file, (for writing this time) and we use another foreach loop to loop through the array, and write each line back to the file..

# Open the file for writing.
open DATAOUT, ">$data_file" or die "can't open $data_file $!";

# Start a foreach loop assigning
# each line to $line, in turn.
foreach my $line (@array_of_data)
   {
   # Print each line in turn to the new filehandle DATAOUT
   print DATAOUT "$line";
   }
# Close the new file.
close (DATAOUT)

Now the file $data_file, contains the same text it did before, except now any instance of "dangerous" is "safe"

Cool huh?
(I could have written that entire script in less then half the size by combining the various sections together.. but that would have been less instructive, so I did it the long way.)

You might be wondering what the use of this stuff is.. here are a couple of examples keeping HTML in mind...

- Opening a html file, and stripping out all the HTML tags, so you end up with just the text of a web page. (you can use substitute regex that tells Perl to replace the tags < > and anything in between them with nothing.. (ie remove them.)

- You want to change the background color in 3 dozen webpages, (or 10000), you would use opendir ((open directory), instead of open) and "open" all files ending in .htm or .html etc,, and replace bgcolor='white' to bgcolor='black' or whatever your preferences are,, in all of them. see? this is the reason that unix administrators love Perl, they can do stuff in seconds, that would take hours to do by hand. (Perl could do hundreds of these in just seconds.)

- You had a system crash, and all your directories are full of *.chk files (created by scandisk when it tries to save any potential data.) you want to search your entire hard drive for these .chk files looking for a specific bit of data you lost inside one of them,,,, and delete each file if it doesn't contain what you wanted. (here you would use "opendir" to go through each directory, then "open" to open each file, pattern matching regex to look for the data you are missing, and "unlink" to delete the file if it doesn't have what you want in it..)

The scripts to do any of those examples, would probably be less then 20 lines long as it is.. but if you were to use modules, (which I will show you by example in the next primers) most of the hard work is done for you, and the scripts would all probably be halved in length again.. (using modules, you could write a script to do any of the above in roughly 10 lines or less.)

Ok, now you know the very basics, regex can get 1000 times more complicated then what I have shown you thus far, but for simple stuff, its not too bad, most of the time, Perl string manipulation involves fairly simple usage like those mentioned above. There are a ton of resources on the internet for those looking for more information on Perl string manipulation and regex.
useful links for regex info.
gatech.edu regex tute.
troubleshooters.com regex tute.
perlarchive.com regex tute.

We've covered most of the basics now, so all tutes after this will be covering more advanced stuff, and hopefully some usable examples.
That will be the point where everything I have shown you thus far starts to make sense. (assuming it doesn't already.)

Again, keep in mind that we have only touched the basics thus far, things can get much much more complicated if you want them to..
but the great thing about Perl is that you now know enough to write usable scripts, (as I soon hope to show you.)
You can now learn new stuff only when you find something you can't do with what you know.
(I should tell you however that once you grasp the fundamentals, it gets addictive and you want MORE!!!! :-) but that's a good thing.

<Cool Tip>
Modules are the single biggest reason to use Perl for a particular task. there are hundreds of modules included standard with any Perl installation.. and literally thousands more you can download and use (for no cost) if the urge takes you. (see http://search.cpan.org for a search engine that finds modules for any particular purpose.)

Here are just a tiny number of the things modules can do for you.
- Interacting very easily with Databases: Access, MySQL, PostgreSQL, Oracle and more.
- Create, resize, crop and modify images, (jpg, gif, png etc..)
- Write an entire web page template in one line of Perl code.
- Set and retrieve cookies, or any other type of header.
- Upload files to your server using just your browser and a Perl script.
- Run other programs on your web server, and display their results in a browser.
- Send email in a dozen different ways, attach files to them, make them pretty with images and html.

That's just a tiny sample of the more common ones, but if you want more, I could keep listing stuff you can do for hours.. Modules are really just snippets of prewritten code (a little bit like subroutines) that you can call and use in any way you like, the vast majority of the work is already done for you. (some modules can run into hundreds of lines of code all prewritten and ready for you to use them at the drop of a hat.. (and quiet easily as well.) and since most of the most useful ones are already on your Perl web server, all it takes is one line of code to call them. (in fact all the full examples I have shown you thus far, have "use"d modules.. (that's a pun, the lines that start with "use" were calling a module of some sort).

# The best bit is you don't have to understand a module to use it... I use heaps of modules that I have never even looked at before... they all use a standards Object Oriented programming to do their magic.

Modules don't even have to be written in Perl some modules have C code in them to get that little extra speed... (but you don't need to know C to use them.)

Since Perl is an "open source" language, there are hundreds of thousands of developers using it all the time, and the vast majority of them contribute back to the community in some way. (many by writing modules.) so new features are being added all the time.. So "where will you want to go tomorrow" would be a good logo for Perl. .

So if you are ready to learn more:
Back to the Tutorial Index