Porting Invregex.py To Javascript (node.js)

December 20, 2023 Post a Comment

I have been trying to port invRegex.py to a node.js implementation for a while, but I'm still struggling with it. I already have the regular expression parse tree thanks to the ret

Solution 1:

I advise you to write iterator classes. They are easy to implement (basically they are state machines), they have a low memory footprint, they can be combined to build up increasingly complex expressions (please scroll down to see the final result), and the resulting iterator can be wrapped in an enumerator.

Each iterator class has the following methods:

first: initializes the state machine (first match)
next: proceeds to the next state (next match)
ok: initially true, but becomes false once 'next' proceeds beyond the last match
get: returns the current match (as a string)
clone: clones the object; essential for repetition, because each instance needs its own state

Start off with the most trivial case: a sequence of one or more characters that should be matched literally (e.g. /foo/). Needless to say this has only one match, so 'ok' will become false upon the first call of 'next'.

functionLiteral(literal) { this.literal = literal; }

Literal.prototype.first = function() { this.i = 0; };
Literal.prototype.next = function() { this.i++; };
Literal.prototype.ok = function() { returnthis.i == 0; };
Literal.prototype.get = function() { returnthis.literal; };
Literal.prototype.clone = function() { returnnewLiteral(this.literal); };

Character classes ([abc]) are trivial too. The constructor accepts a string of characters; if you prefer arrays, that's easy to fix.

functionCharacterClass(chars) { this.chars = chars; }

CharacterClass.prototype.first = function() { this.i = 0; };
CharacterClass.prototype.next = function() { this.i++; };
CharacterClass.prototype.ok = function() { returnthis.i < this.chars.length; };
CharacterClass.prototype.get = function() { returnthis.chars.charAt(this.i); };
CharacterClass.prototype.clone = function() { returnnewCharacterClass(this.chars); };

Now we need iterators that combine other iterators to form more complex regular expressions. A sequence is just two or more patterns in a row (like foo[abc]).

function Sequence(iterators) {
   if (arguments.length > 0) {
      this.iterators = iterators.length ? iterators : [new Literal('')];
   }
}
Sequence.prototype.first = function() {
   for (var i inthis.iterators) this.iterators[i].first();
};
Sequence.prototype.next = function() {
   if (this.ok()) {
      var i = this.iterators.length;
      while (this.iterators[--i].next(), i > 0 && !this.iterators[i].ok()) {
         this.iterators[i].first();
      }
   }
};
Sequence.prototype.ok = function() {
   returnthis.iterators[0].ok();
};
Sequence.prototype.get = function() {
   var retval = '';
   for (var i inthis.iterators) {
      retval += this.iterators[i].get();
   }
   return retval;
};
Sequence.prototype.clone = function() {
   return new Sequence(this.iterators.map(function(it) { return it.clone(); }));
};

Another way to combine iterators is the choice (a.k.a. alternatives), e.g. foo|bar.

function Choice(iterators) { this.iterators = iterators; }

Choice.prototype.first = function() {
   this.count = 0;
   for (var i inthis.iterators) this.iterators[i].first();
};
Choice.prototype.next = function() {
   if (this.ok()) {
      this.iterators[this.count].next();
      while (this.ok() && !this.iterators[this.count].ok()) this.count++;
   }
};
Choice.prototype.ok = function() {
   returnthis.count < this.iterators.length;
};
Choice.prototype.get = function() {
   returnthis.iterators[this.count].get();
};
Choice.prototype.clone = function() {
   return new Choice(this.iterators.map(function(it) { return it.clone(); }));
};

Other regex features can be implemented by combining the existing classes. Class inheritance is a great way to do this. For example, an optional pattern (x?) is just a choice between the empty string and x.

functionOptional(iterator) {
   if (arguments.length > 0) {
      Choice.call(this, [newLiteral(''), iterator]);
   }
}
Optional.prototype = newChoice();

Repetition (x{n,m}) is a combination of Sequence and Optional. Because I have to inherit one or the other, my implementation consists of two mutually dependent classes.

functionRepeatFromZero(maxTimes, iterator) {
   if (arguments.length > 0) {
      Optional.call(this, newRepeat(1, maxTimes, iterator));
   }
}
RepeatFromZero.prototype = newOptional();

functionRepeat(minTimes, maxTimes, iterator) {
   if (arguments.length > 0) {
      var sequence = [];
      for (var i = 0; i < minTimes; i++) {
         sequence.push(iterator.clone());   // need to clone the iterator
      }
      if (minTimes < maxTimes) {
         sequence.push(newRepeatFromZero(maxTimes - minTimes, iterator));
      }
      Sequence.call(this, sequence);
   }
}
Repeat.prototype = newSequence();

As I said earlier, an iterator can be wrapped into an enumerator. This is simply a loop that you can break whenever you want.

function Enumerator(iterator) {
   this.iterator = iterator;

   this.each = function(callback) {
      for (this.iterator.first(); this.iterator.ok(); this.iterator.next()) {
         callback(this.iterator.get());
      }
   };
}

Time to put it all together. Let's take some silly regular expression:

([ab]{2}){1,2}|[cd](f|ef{0,2}e)

Composing the iterator object is really straightforward:

function GetIterationsAsHtml() {

   variterator=newChoice([
      newRepeat(1, 2,
         newRepeat(2, 2, newCharacterClass('ab'))),
      newSequence([
         newCharacterClass('cd'),
         newChoice([
            newLiteral('f'),
            newSequence([
               newLiteral('e'),
               newRepeatFromZero(2, newLiteral('f')),
               newLiteral('e')
            ])
         ])
      ])
   ]);

   variterations='<ol>\n';
   varenumerator=newEnumerator(iterator);
   enumerator.each(function(iteration) { iterations += '<li>' + iteration + '</li>\n'; });
   return iterations + '</ol>';
}

This yields 28 matches, but I will spare you the output.

My apologies if my code is not compliant to software patterns, is not browser-compatible (works OK on Chrome and Firefox) or suffers from poor OOP. I just hope it makes the concept clear.

EDIT: for completeness, and following OP's initiative, I implemented one more iterator class: the reference.

A reference (\1 \2 etc) picks up the current match of an earlier capturing group (i.e. anything in parentheses). Its implementation is very similar to Literal, in that it has exactly one match.

functionReference(iterator) { this.iterator = iterator; }

Reference.prototype.first = function() { this.i = 0; };
Reference.prototype.next  = function() { this.i++; };
Reference.prototype.ok    = function() { returnthis.i == 0; };
Reference.prototype.get   = function() { returnthis.iterator.get(); };
Reference.prototype.clone = function() { returnnewReference(this.iterator); };

The constructor is given an iterator that represents the referenced subpattern. Taking (foo|bar)([xy])\2\1 as an example (yields fooxxfoo, fooyyfoo, barxxbar, baryybar):

var groups=newArray();

var iterator =new Sequence([
   groups[1] =new Choice([new Literal('foo'), new Literal('bar')]),
   groups[2] =new CharacterClass('xy'),
   new Reference(groups[2]),
   new Reference(groups[1])
]);

Capturing groups are specified as you build up the tree of iterator classes. I am still doing that manually here, but eventually you want this to be automated. That is just a matter of mapping your parse tree to a similar tree of iterator classes.

EDIT 2: here's a relatively simple recursive function that will convert a parse tree produced by ret.js into an iterator.

functionParseTreeMapper() {
    this.capturingGroups = [];
}
ParseTreeMapper.prototype.mapToIterator = function(parseTree) {
    switch (parseTree.type) {
        case ret.types.ROOT:
        case ret.types.GROUP:
            var me = this;
            var mapToSequence = function(parseTrees) {
                returnnewSequence(parseTrees.map(function(t) {
                    return me.mapToIterator(t);
                }));
            };
            var group = parseTree.options ?
                newChoice(parseTree.options.map(mapToSequence)) : 
                mapToSequence(parseTree.stack);
            if (parseTree.remember) {
                this.capturingGroups.push(group);
            }
            return group;
        case ret.types.SET:
            returnnewCharacterClass(this.mapToCharacterClass(parseTree.set));
        case ret.types.REPETITION:
            returnnewRepeat(parseInt(parseTree.min), parseInt(parseTree.max), this.mapToIterator(parseTree.value));
        case ret.types.REFERENCE:
            var ref = parseInt(parseTree.value) - 1;
            return ref inthis.capturingGroups ?
                newReference(this.capturingGroups[ref]) :
                newLiteral('<ReferenceOutOfRange>');
        case ret.types.CHAR:
            returnnewLiteral(String.fromCharCode(parseTree.value));
        default:
            returnnewLiteral('<UnsupportedType>');
    }
};
ParseTreeMapper.prototype.mapToCharacterClass = function(parseTrees) {
    var chars = '';
    for (var i in parseTrees) {
        var tree = parseTrees[i];
        switch (tree.type) {
            case ret.types.CHAR:
                chars += String.fromCharCode(tree.value);
                break;
            case ret.types.RANGE:
                for (var code = tree.from; code <= tree.to; code++) {
                    chars += String.fromCharCode(code);
                }
                break;
        }
    }
    return chars;
};

Usage:

var regex = 'b[a-n]{3}';
var parseTree = ret(regex);    // requires ret.jsvariterator = new ParseTreeMapper().mapToIterator(parseTree);

I put all components together in this demo: http://jsfiddle.net/Pmnwk/3/

Note: many regex syntax constructs are not supported (anchors, look-ahead, look-behind, recursion), but I guess it is already pretty much up to par with invRegex.py.

Solution 2:

Here's a version that makes a function for each part of the input and composes all of them to produce a function that'll generate each regex result and feed it into that argument:

//Takes in a list of things, returns a function that takes a function and applies it to// each Cartesian product. then composes all of the functions to create an// inverse regex generator.functionCartesianProductOf() {
    var args = arguments;
    returnfunction(callback) {
        Array.prototype.map.call(args, function(vals) {
            returnfunction(c, val) {
                vals.forEach(function(v) {
                    c(val + v);
                });
            };
        }).reduce(function(prev, cur) {
            returnfunction(c, val) {
                prev(function(v){cur(c, v)}, val);
            };
        })(callback, "");
    };
}

Modified to work off a parse tree (copied a litte code from here):

//Takes in a list of things, returns a function that takes a function and applies it to// each Cartesian product.functionCartesianProductOf(tree) {
    var args = (tree.type == ret.types.ROOT)? tree.stack :
                ((tree.type == ret.types.SET)? tree.set : []);

    returnfunction(callback) {
        var funs = args.map(function(vals) {
            switch(vals.type) {
                case ret.types.CHAR:
                    returnfunction(c, val) {
                        c(val + vals.value);
                    };
                case ret.types.RANGE:
                    returnfunction(c, val) {
                        for(var i=vals.from; i<=vals.to; i++) {
                            c(val+String.fromCharCode(i));
                        }
                    };
                case ret.types.SET:
                     returnfunction(c, val) {
                         CartesianProductOf(vals)(function(i) {c(val+i)});
                     };
/*                   return function(c, val) {
                        vals.set.forEach(function(v) {
                            c(val + v);
                        });
                    };        */case ret.types.REPETITION:
                    var tmp = CartesianProductOf(vals.value);

                    if(vals.max == vals.min) {
                        returnfillArray(function(c, val) {
                            tmp(function(i){c(val+i);}); //Probably works?
                        }, vals.max);
                    } else {
                        returnfillArray(function(c, val) {
                            tmp(function(i){c(val+i);});
                        }, vals.min).concat(fillArray(function(c, val) {
                            c(val);
                            tmp(function(i){c(val+i);});
                        }, vals.max-vals.min));
                    }
                default:
                    returnfunction(c, val) {
                        c(val);
                    };
            }
        }).reduce(function(prev, cur) { //Flatten array.return prev.concat(cur);
        }, []);

        if(tree.type == rets.type.ROOT) //If it's a full tree combine all the functions.
            funs.reduce(function(prev, cur) { //Compose!returnfunction(c, val) {
                    prev(function(v){cur(c, v)}, val);
                };
            })(callback, "");
        else//If it's a set call each function.
            funs.forEach(function(f) {f(callback, "")}); 
    };
}

functionfillArray(value, len) {
    var arr = [];
    for (var i = 0; i < len; i++) {
        arr.push(value);
    }
    return arr;
}

If you're alright with a less functionalish, more C-esque solution:

function helper(callme, cur, stuff, pos) {
    if(pos == stuff.length) {
        callme(cur);
    } else 
        for(var i=0; i<stuff[pos].length; i++) {
            helper(callme, cur+stuff[pos][i], stuff, pos+1);
        }
}

function CartesianProductOf(callback) {
    helper(callback, "", Array.prototype.slice.call(arguments, 1), 0);
}

Solution 3:

How about this:

var tokens = [
    ['0', '00', '01', '1', '10', '11'],
    ['@'],
    ['a', 'b', 'c', 'd', 'e', 'f'],
];

functioncartesianProductOf(arr, callback) {
  var cur = [];
  var f = function(i) {
    if (i >= arr.length) {
      callback(cur.join(''));
      return
    }
    for (var j=0; j<arr[i].length; j++) {
      cur[i] = arr[i][j];
      f(i+1);
    }
  };
  f(0);
}

cartesianProductOf(tokens, function(str) { console.log(str); });

Solution 4:

It sounds like you're asking for Lazy Cartesian Product: you do want the Cartesian product, but you don't want to compute them all beforehand (and consume all that memory). Said another way, you want to iterate through the Cartesian product.

If that's right, have you checked out this Javascript implementation of the X(n) formula? With that, you can either iterate over them in natural order <<0,0,0>, <0,0,1>, <0,1,0>, ...> or choose an arbitrary position to calculate.

It seems like you can just do:

// move through them in natural order, without precomputationlazyProduct(tokens, function (token) { console.log(token); });

// or...// pick ones out at randomvarLP = newLazyProduct(tokens);
console.log(LP.item(7));   // the 7th from the list, without precomputeconsole.log(LP.item(242)); // the 242nd from the list, again without precompute

Surely I must be missing something...? Generators are simply overkill given the X(n) formula.

UpdateInto a JSFiddle I have placed an instrumented version of the lazyProduct code, a sample tokens array of arrays, and a call to lazyProduct with those tokens.

When you run the code without modification, you'll see it generates the 0@a, etc. output expected from the sample tokens array. I think the link explains the logic pretty well, but in summary... if you uncomment the instrumentation in lazyProduct, you'll notice that there are two key variables lens and p. lens is a pre-compute of the length of each array (in the array of arrays) passed in. p is a stack that holds the current path up to where you are (eg, if you're "1st array 3rd index, 2nd array 2nd index, and 3rd array 1st index" p represents that), and this is what's passed into your callback function.

My callback function just does a join on the arguments (per your OP example), but again these are just the corresponding values in p mapped to the original array of arrays.

If you profile this further, you'll see the footprint needed to build the Cartesian product is limited to just what you need to call your callback function. Try it on one of your worst case tokens to see.

Update 2 I coded about 75% of an approach based on Cartesian product. My API took a ret.js parse tree, converted that to RPN, then generated sets of sets to pass into a X(n) calculator. Using @ruud example of ([ab]{2}){1,2}|[cd](f|ef{0,2}e), this would be generated:

newOption([
    newSet([
        newSet(newSet(['a','b']), newSet['a','b'])),
        newSet(newSet(['a','b']), newSet['a','b']))
    ]),
    newSet(
        ['c', 'd'],
        newOption([
            newSet(['f']),
            newSet(['e']]),
            newSet(['e','f']),
            newSet(newSet(['e','f']), newSet(['e', 'f']))
        ])
])

The tricky parts were nested options (buggy) and inverse character classes & back-references (unsupported).

This approach was getting brittle, and really the Iterator solution is superior. Converting from your parse tree into that should be pretty straightforward. Thanks for the interesting problem!

Solution 5:

There already are plenty of good answers here, but I specifically wanted the generator part to work, which did not for you. It seems that you were trying to do this :

//the alphanumeric part
for (x of alpha()) for (y of numeric()) console.log(x + y);

//or as generator itself like you wanted
function* alphanumeric() {
    for (x of alpha()) for (y of numeric()) yield(x + y);
}
//iterating over it
for (var i of alphanumeric()) {
    console.log(i);
}

Output:

a0
a1
b0
b1
c0
c1

You can use this for cartesian product required in regex matching.

JavaScript General