Optimizing Java Regular Expressions

I have been working with regular expressions a lot lately and thought I should share some ideas on how to make regular expressions faster in Java. Most of this ideas are quite simple yet may give you very good improvements in practice.

Low hanging fruits

Compile Patterns you use frequently.

Use small capture groups to prevent long backtracking.

Whenever possible, use non-capturing groups. (?:P) instead of (P).

Reluctant, greedy, and possessive quantifiers

Usually, reluctant quantifiers will run faster than greedy quantifiers. I am talking about +? and *? being faster than + and *. When the engine finds a greedy quantifier, it will go for the biggest possible match and start backtracking. In big strings this is usually extremely slow if you are trying to match a small substring (which is often the case), so going for reluctant quantifiers - that “grow” the match - instead of greedy ones - that “shrink” the substring they are trying to match - is very likely to improve performance.

Possessive quantifiers improve regular expression performance a lot. Therefore, you should use them whenver you can. P?+, P*+, P++ will match the pattern as any greedy quantifier would, but will never backtrack once the pattern has been matched, even if refusing to do so means that whole expression will not match.

NetHogs

This is a very short post to show one of my favorite system monitoring tools: NetHogs. If you are on a Linux box, it should be fairly easy to get it no matter what is your Linux distribution.

Once you have it, just run it as root and you will have a minimalistic, readable, and very useful list of which processes are consuming your bandwidth and how much of it each process is using.

You can also run it with the -v1 flag to get the sum of KB instead of the KB/s rate of each process.

License Notes Should Have a Date

When working on dungeon license notes I came to notice that we are in 2016 and this means that all files that I change right now require their license notes to be updated.

I wondered if I actually needed that, so I went to Programmers Stack Exchange and asked about it. I got quite helpful answers and comments within a few hours. Essentially, it boils down to

  1. Copyright expires, therefore you must provide a license date.
  2. The format Copyright (C) [years] [name] should not be changed.

Therefore, I must set up Git Hooks that will update all license notes of the files with a list of years in which the file has been modified.

The solution will likely be open-sourced so it benefits a bigger group of people.

Justifying Text

In this post I briefly describe how good text justification is done using dynamic programming. In the future, I may post code for this problem.

Start by writing a ‘badness’ function that produces a value corresponding to how bad it is to have a certain set of words forming a line.

LaTeX uses the following formula for badness

badness(i, j) = (page_width - total_width)³    if [i, j] fits
                ∞                              otherwise

where
  i is the index of the first word of the line
  j is the index of the last word of the line
  page_width is the width of the document page
  total_width is the width occupied by the letters of the words [i, j]

Then, by using dynamic programming techniques, you need to find the line combination that gives the minimum badness sum.

Recovering Deleted Files

This post is specific to ext3 and ext4 partitions.

Yesterday I accidentally invoked bunzip2 without the --keep flag. This made me lose the original, compressed file. It would be complicated to get it back, therefore I used to extundelete to recover it. This post gives an example of the usage of extundelete.

First of all, I needed to have the partition mount as read-only, but I was using it to run my operating system, so I had to reboot and start from a live image.

Once I had it running, getting extundelete was easy

sudo dnf install extundelete

After that, I simply invoked

sudo extundelete --recover-file <path-to-file> <device>

In my case,

sudo extundelete --recover-file mg/wiki/dump.bz2 /dev/sda7

Finally, I copied the recovered file (found on a RECOVERED_FILES directory) to its final destination with cp and was done with it.

Hopefully this will be helpful to you someday.