Where GREP Came From - Computerphile

May 01, 2020

I thought that today we would maybe talk about '

grep

', a well-known command in the UNIX world. Something that has been around since the early 1970s. What '

grep

' allows you to do is look for text patterns - arbitrary text patterns in one or more files and there could be an unlimited number of input files. Or the input could come from some other program, for example if you are using Unix pipes. So you take some program and pipe it to 'grep' and that way, no matter the amount of input, 'grep' can filter or show you the things you're interested in.

And that's something that It's not very convenient to use a text editor, in any case. One of the problems with 'grep' has always been:

where

does that strange name come from? So I thought maybe I could tell that story, if it was of any interest and we'll see

where

we go from there. The way it

came

about: You have to go back to the early days of computing, before everyone in this room was born except me. Let's say something like 1970-71: the early days of UNIX. The computer running UNIX on was a PDP 11. At the time it was probably an 11/20.

More Interesting Facts About,

where grep came from computerphile...

It was a machine that had very, very little computing power. He didn't run very fast. He didn't have much memory either. Probably something on the order of 32K, maybe 64K bytes and that's 64 Kbytes, not megabytes. And also very small secondary storage, you know, a few megabytes of disk and things like that. So computing resources were very, very limited and that meant that a lot of the software that existed in the early days of UNIX tended to be fairly simple and straightforward. And that reflected not only the kind of... relative "weakness" of the hardware but also the personal tastes of the people doing the work, primarily Ken Thompson and Dennis Ritchie.

So one of the accessories... one of the standard programs that people use is the text editor on any system. The UNIX text editor was called 'ed' and is not pronounced 'edd'. At least for those who know, it's pronounced 'ee dee'. And this was written by Ken Thompson and I think it was basically a stripped down version of an editor called QED, which Ken had worked with and in which he had worked long before. So it was a very small, simple and straightforward editor, and what you have to remember is that, in those days, there were also no real video display terminals, not the kind we are used to today, or even ago. 10 or 20 years.

But in fact, all the computing, all the editing, etc., was done on paper. Do you remember the role? If you zoom down here you can see the paper! This meant that there were many things trying to minimize the use of paper. It also meant that editors worked one line at a time, or several lines at once, but there was no cursor steering, so it couldn't be moved within a line. And then the 'ed' text editor reflected that kind of thing. Maybe what I should do is just take a quick look at what 'ed' was like? so the commands for 'ed' were single letter commands.

So, for example, there was a command called 'p', which meant 'print'; There was a command called 'd', which deleted a line. There was a command called 's', which took a while... that said 'substitute' so you could change this, you know, 'ABC' to 'DEF', or something like that. There was an 'add' command that just said 'add more text' and you could add a bunch of lines and then finish it with something. Of course, there was a "read" command to be able to read information from a file, and there was a "write" command to be able to put it back into a file. a bunch of other things like that.

That was the essence of what he did. One of the things that 'ed' did very well was that, OK, this is applied by default to the current line. But what do you do when you want to have more specifics about which lines you are operating on? Then you could say things like 'print line 1 to line 10'. So this would print the first one at 10 lines. 10 was that. But let's say you wanted to print all the lines in the file. So there was an abbreviation called '$'. So you could say '1,$p' and that would print all the lines in the file.

Or you might say, "Wow! I wonder... I just want to see the last line." So I could say '$p' and that would give me that. You could even leave out the 'p', but that's enough. Or you could remove the last line by saying '$d'. Or you could remove the first line by saying '1d'. That's kind of the line direction. So far it's not very complicated. What 'ed' added to all of that, and this is definitely Ken's influence, was the idea of regular expressions. So a regular expression is a text pattern; is a way of specifying text patterns.

They could be literal texts like the word 'print' or they could be something more complicated, like things that start with 'Prin' and can continue with 'Print' or 'Princeton' or 'Princess', or whatever. That kind of thing. And the way you wrote regular expressions in the 'ed' text editor was to say '/' and then type the characters of the regular expression. So, you could say '/print/' and that would be something that would match the next line, what you were working on, that contained the word 'print' anywhere within it. So the regular expressions in the 'ed' editor were somewhat different: a little more sophisticated and complicated than the regular expressions that can be found in shell wildcards, where, for example, a star means 'anything'.

So, same idea of text patterns - a slightly different specification - a different way of writing patterns but suitable for text editing. And then you could say things like "I want to find the next occurrence of the word 'print' in my file." And then there I would be. And so on. Well, that's the 'ed' text editor. We are very far from 'grep' at the moment. So what is "grep" all about? Well, it turns out that at the time this was happening, 'ed' was the standard text editor. But, like I said, the machines you're working on are very weak.

There wasn't much computing power in many ways. And in fact, one of the limitations was that you couldn't edit a very large file, because there wasn't enough memory and the 'ed' ran entirely within memory, so you were stuck. One of my colleagues at the time, Lee McMahon, was very interested in doing text analysis. The kind of thing we would call today, perhaps, Natural Language Processing. And so what Lee wanted to do...he had been studying something that, at the time, was the very interesting question of who were the authors of some fundamental American documents called the Federalist Papers.

The Federalist Papers were written by James Madison, Alexander Hamilton, and John Jay in 1787 and '88, if I remember correctly. There were 85 of these documents, but they were published anonymously under the name Publius. And so we had no idea, in theory, who wrote them. And that's why there have been many scholars trying to find out for sure. It's well known who wrote some of them and I think others are still a little unclear, so Lee was interested to see if you could really, through textual analysis of his own creation, find out who wrote these things. So it's OK.

But it turns out that these 85 documents, in total just over a megabyte (I mean a low noise level by today's standards), would not fit. He couldn't edit them all into 'ed'. And then what do you do? So one day he said, "I just want to go through and find all the occurrences of 'something' in the Federalist Papers so I can see them!" And he told this to Ken Thompson and then he went home to dinner or something. And he

came

back the next day and Ken had written the program, and the program was called "grep." And what 'grep' did was go through a bunch of documents (one or more files) and just find all the places where a particular regular expression appeared in those things.

And so... it turns out that one more of the commands in 'ed' is a command called 'g'. And this meant "global." And what he was saying was that, on every line that matches a particular regular expression, for example 'print', I can run an 'ed' command. So, he could say: "On every line that contains the word 'print' 'I'll just print it.' Then I can see what my various statements would look like in print. Or you could, in some other way, say 'g' (and some other regex in there) and remove them. Then you could delete all the comments in a program, or something like that.

So the general structure of that is 'g' followed by (in bars), a regular expression, followed by the letter 'p' - g/re/p - and that's the genesis of where it came from. Well, this is in a way the genius of Ken Thompson. A beautiful program, written in a very short time, taking some other program and just cutting it up and then giving it a name that stuck. That's the story of where 'grep' came from. Let me add one thing: Literally 25 years ago, it was the spring of 1993, when I was teaching at Princeton as a visitor.

And I needed an assignment for my programming class. And I thought "Mmm!" So what I did was I told the students in the class, "OK, here's the source code for 'ed'." At that time it was probably 1800 lines of C. "His job is to take these 1800 lines of C." and convert them to 'grep' like a C program. Okay, you have a week to do it." And I told them, at the time, that they had a couple of advantages. First, they knew what the goal was. Someone had already done ' grep' so they knew what it should look like.

And all they had to do was replicate that behavior. And the other thing is that it was now written in C. The original 'grep' was written in PDP 11 assembly language. And of course , they also had a serious disadvantage: none of them were Ken Thompson.

Watch Video & Subscribe

If you have any copyright issue, please Contact