Sunday, September 8, 2013

Simple project analysing the throughput of Java IO/NIO libraries and otimization


Some days ago I went in the Code Ranch forum to search some interesting topic to help. I quickly found this case of Tien Shan. Basically, he has an amount of multi format files that he needs to extract some data from it. For example, it has some text and then a XML document and then more text. Tie described that he had pretty poor performance issues reading the files and trying to process it. This was the first point that got my attention and I tried to research and make some discoveries.

I found that some outstanding guys already made some cool benchmarks about the many possibles configurations that we have in Java. The most cool references I found were these [1]  [2]  [3]  [4]. Based on those discoverers, I created a simple Maven based project that I could test and see the differences of those approaches and also put some concurrent threads in the plan due that could enhance the performance. I created the project SimpleReadFiles that made easy to create some data to test (see the pom.xml) and read it and it outputs the elapsed time, so you could know how much time took to read a file and the whole process time (due the help of concurrent programming, that makes difference between that and sum all the time of  each file). I am looking forward to put some more formatted output and make some statistics and graphs open source, so people can make it easy to compare and try out in their environments.  I am also working with tags there (it just has one there, this one) to save another approach. I will also appreciate if someone sends some contribution there/here of any kind =)

Well, returning to the Tien case, XPath would not help us due the diverseness formats in a single file (at least those frameworks I know which have many problems to deal with non correctly XML format document). I started to think about implementing an algorithm to do String Matching (like those cited in wikipedia and due I used to program that and others things in ACM-ICPC event). To be more clear and relying on the standard libraries Java have, I thought to use Regex to solve the case. I started playing with the data on a very cool site using javascript named regexpal and then I used the Javadoc to clear some questions of usage. The result I put in gist of GitHub -> .

Now, I will put all together and send to Tien. I hope the Regex+good_IO_configuration solve his problem.

Hope you enjoyed the reading, any question or comment, feel free to write =D

No comments:

Post a Comment