Jacob Ruiz

View Original

Mastering Javascript Fundamentals: Lookaheads and backreferences

Get the fundamentals down and the level of everything you do will rise. - Michael Jordan

As stated in my original post, I do 1 hour of video lessons from Watch and Code every day. If you're interested in learning Javascript in a way that goes beyond basic tutorials and gives you a foundational, practical knowledge without relying on frameworks - I'd highly recommend it. If you're reading these posts, please keep in mind that these are just my notes, and I'm not an expert (yet!). If your goal is also to master the fundamentals of Javascript, please head over to Watch and Code and start your journey there!

All screenshots were annotated using Shotty.


Lookaheads and backreferences

Match "w" only if what comes after "w" is "w":

See this content in the original post

This is the syntax for a "positive lookahead".

Here it is shown in RegExr:

The ?=w piece won't be included in the result, it is only there to specify this condition.

We can run some more examples:

Match "w" only if it's followed by an "h".

Only match "w" if it's followed by "oot".

Looking back at our original example: "match every w except the last one".

There's a pretty big problem with our current implementation: what if we have non-consecutive w's? Everything breaks.

The reason is that our regular expression isn't allowing for characters to be in between the w's. 

We need a way to say that between one w and the next, we can see any character, any number of times.

Well to say "any character", we can use the meta character: .

And to say "any number of times" we can use the quantifier {0,}, which means "zero or more times".

So we can write:

See this content in the original post

Here it is in RegExr:

There's actually a meta character for "zero or more". 

We know that one or more is a plus sign, +.

Zero or more is a star, *.

So this:

See this content in the original post

Is the same as this:

See this content in the original post

We saw paratheses, (), before when we looked at capture groups. Capture groups allow us to refer to a portion of a match using $1, $2, etc.

It's important to note that these parentheses are totally different than (?=), which is what we use here for lookaheads.

So far we've been able to get all w's except the last one. What if we want to do the opposite?

What if we want only the last w? We can do that with a very small change. All we have to do is change the equals sign, =, to an exclamation or bang, !.

The way to read this is, "match w if what follows is not this pattern (all characters, zero or more, followed by w)".

Now we know about two types of lookaheads:

Positive lookaheads: (?= )
Negative lookaheads: (?! )

Another quick example of negative lookaheads. Lets match grey only if it's not followed by " hound":

If we want to do a positive lookahead, we switch the ! to a = and we will match grey only if it is followed by hound:

What about a more generalized case of our original example?

What if we wanted the last instance of different characters?

Our expression succeeds in getting the last w, but what if we wanted to succeed in also getting the last a?

Overall we'd like to extend this to get the last value of every letter. Grab the last b, the last a, the last c, etc.

To start, lets put the w in a capture group and refer to it inside of our lookahead. To do this, we use \1.

Now let's use the pipe to say "a or w".

This gives us the last instance of a, and the last instance of w.

The way you want to read this is:

Match a or w if it's not followed by zero or more characters, followed by a or w. 

Because in order for an "a" to be the last "a", it must not be followed by any number of characters with an "a" at the end.

So this gives us a nice working example for a and b, but what about other letters. Well, we could just add them each by hand, separated by pipes:

But this is obviously a tedious and error-prone approach. 

Instead, we can just use the dot (.) to represent "all characters", and it works exactly the same way:

 

This regular expression matches the last instance of any character.

Let's look at another simple example:

Lets get a capture group and a back reference to refer to that capture group:

Then let's put just the letter "r":

This is exactly equivalent to "rr".

We can add quantifiers to this. Imagine we want "r" followed by two "rr"s.

We can match "rarr" by adding an "a" in the middle:

Summary

  • Positive lookaheads: (?=)
  • Negative lookaheads: (?!)
  • Back references: ()/1