Sunday, September 8, 2013

Regex, Parsing and some benchmarks in Java

Hi guys!

On the last post, we made a regex to solve a problem at CodeRanch. However, there was a complaint about reading performance, so we made some analysis about the regular expressions we were using.

We had 50 text files with 10MB each similar to this:


 MIME-Version: 1.0  
 Content-Type: multipart/mixed; boundary="MIME_boundary-1"; type="text/xml";  
 Content-Description: Some_description  
 --MIME_boundary-1  
 Content-Type: text/xml; charset="UTF-8"  
 Content-Transfer-Encoding: 8bit  
 Content-Location: dataHeader.xml  
 <?xml version="1.0" encoding="UTF-8"?>  
 <Header>  
  <DataLen>123456</DataLen>  
 </Header>  
 --MIME_boundary-2  
 Content-Type: application/octet-stream  
 Content-Location: data.bin  
 1828323232  
 --MIME_boundary-2  
(Pastebin here)

We wanted to take the content of the tag <DataLen>. Notice this is not a regular XML.

So the first idea could be find only the tag <DataLen> and take its content:

 Pattern.compile("(?:<DataLen>(?<first>.*)</DataLen>|Content-Location: data\\.bin\\n(?<second>.*)\\n)");  

Well, running 10 times this code, it took around 210ms ~ 230ms each file on Mac OS X 10.8.4 with Oracle JDK 1.7._0.21 64 bits.

Looks fair. But refactoring the regular expression to grab a bigger pattern, we obtained different results:

 Pattern.compile("--MIME_boundary-\\d*\\n(?:Content-Type: text/xml; charset=\"UTF-8\"\\nContent-Transfer-Encoding: 8bit\\nContent-Location: dataHeader\\.xml\\n<\\?xml version=\"1\\.0\" encoding=\"UTF-8\"\\?>\\n<Header>\\n <DataLen>(?<first>.*)</DataLen>|Content-Type: application/octet-stream\\nContent-Location: data\\.bin\\n(?<second>.*)\\n)");  

Notice this time we took the whole text until we reach the tag <DataLen>, and not only the tag itself.
This time, on the same machine, running 10 times, it took around 60 ms each file!

Why does it happen? The explanation is simple - it's the String Matching Algorithm the we mentioned on our last post! Basically, the bigger is the given pattern, bigger ill be the jump on the file, and faster will be the search.

The tradeoffs os this approach - memory consumption increases, the regex gets more complex and the file structure should be very well formed.

Do you want to try your own benchmarks? Take a look at our repo, specially at this and this class.

This is it! Hope you have enjoyed the benchmarks!



2 comments: