Some days ago I went in the Code Ranch forum to search some interesting topic to help. I quickly found this case of Tien Shan. Basically, he has an amount of multi format files that he needs to extract some data from it. For example, it has some text and then a XML document and then more text. Tie described that he had pretty poor performance issues reading the files and trying to process it. This was the first point that got my attention and I tried to research and make some discoveries.
I found that some outstanding guys already made some cool benchmarks about the many possibles configurations that we have in Java. The most cool references I found were these    . Based on those discoverers, I created a simple Maven based project that I could test and see the differences of those approaches and also put some concurrent threads in the plan due that could enhance the performance. I created the project SimpleReadFiles that made easy to create some data to test (see the pom.xml) and read it and it outputs the elapsed time, so you could know how much time took to read a file and the whole process time (due the help of concurrent programming, that makes difference between that and sum all the time of each file). I am looking forward to put some more formatted output and make some statistics and graphs open source, so people can make it easy to compare and try out in their environments. I am also working with tags there (it just has one there, this one) to save another approach. I will also appreciate if someone sends some contribution there/here of any kind =)
Now, I will put all together and send to Tien. I hope the Regex+good_IO_configuration solve his problem.
Hope you enjoyed the reading, any question or comment, feel free to write =D