Thursday, March 1, 2007

Java String performance

We all know that misuse of Strings degrade performance. But more often than not, we go back to look at our String code only when we are asked to optimize performance. This is a short blog on the performance aspect of Strings which would be good to keep in mind when coding.

So, the first thing to remember is that Strings are immutable- once you create them, you can't change them. Oh yes, you might think you're modifying them, but in reality, a new String is generated. Warning bells should be ringing in your head by now :-)

If you are concatenating Strings over and over, lets say in a loop, then you know that because they are immutable, new Strings keep getting generated. The javac compiler internally uses a StringBuffer to do this- so for example, you have

 String itemList = "";
 itemList=itemList + items[i].description;
in a loop.

What happens is, within the loop, two objects are generated. One is a StringBuffer-
itemList=new StringBuffer().append(itemList).
append(items[i].description).toString();

The other is the String that gets assigned to itemList via the toString().

A much better way out would be to create a StringBuffer outside the loop, and just append to it within the loop. More efficient.

Also beware of certain String operations like substring, trim or replace that actually work by creating new Strings. Consider using equivalent StringBuffer methods instead.

Another thing we do not think about is comparing Strings.
Comparing a String via equals() can be fairly expensive. It basically involves a character-by-character comparison, though every effort is made to avoid it. For example, if the lengths do not match, the character by character comparison is not done. But there are many cases where lengths are same and strings are long... in these cases, equals() can be time consuming.
Now we know that comparing Strings with == will not solve our problem because == compares the pointer references, and since both Strings are different instances, you will get a false, even if characters match.

Fortunately, there is a way out, if you know about the intern() method. According to the JavaDoc, intern() returns a canonical representation for the string object. A pool of strings, initially empty, is maintained privately by the class String. When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.

What this means is that if you have two identical Strings and intern them, only one instance is maintained. So then you can safely use == and have it compare the references and hence return a true.

Now the compiler automatically interns String constants for you. But you could explicitly intern() other Strings that will be compared extensively and repetitively, for example.

As with everything else in life, make sure that if you are interning, do it responsibly. If you intern every String in sight, you are in deep trouble. A pool of Strings will have maintenance costs and you cannot rely on interned Strings getting garbage collected. Hence, if you use it rashly, your performance will actually degrade and you will possibly run into memory issues.
It is worth knowing about intern though- after tuning other areas of your code, you could consider intern especially if you know for a fact that a particular operation will be more efficient by interning the String in question.

15 comments:

jervin said...

There is another potentially big gotcha with Strings. Since in Java, the java.lang.String class is really a wrapper around a character array (char[]), you must be wary of grabbing a substring. Java makes the assumption that you want to save space, so most times it will just hand back a new wrapper that references the original character array. This is usually a fine assumption except when all you want is a small piece of a large string. In that scenario you can quickly chew up alot of space, the original ( potentially huge ) character array is never garbage collected.

In reality, I have not really run into this issue a great deal, but it is something to keep in mind in those certain situations. I believe that invoking the constructor on String will recreate the array and thus remove this problem.

BenD said...

You might also want to look into StringBuilder vs StringBuffer in Java 1.5+

StringBuilder is the non-treadsafe version of StringBuilder.

http://javahowto.blogspot.com/2006/08/stringbuilder-vs-stringbuffer.html

However, with that said, there is little benefit to using non-synchronized classes is single threaded apps. The JVM is pretty damn smart about these things now a days.

thought-bytes said...

Yes, StringBuilder is definitely worth using- if you're not worried about synchronization, then go with a StringBuilder. And as I have indicated earlier, think first and use in moderation. This echoes a similar sentiment...

Anonymous said...

interned strings go in the permgen space, too much and you'll get OOM exceptions...

Anonymous said...

Intern should be considered harmful. Once a Java developer understands how use the mutable form of String objects (Builder) then there is no reason to play with String's behavior.

Natxotron said...

It is interesting to prefix the size of the Stringbuffer, it improves the CPU time because the JVM has not to resize the buffer.

lazythinker said...

This is one of those optimizations that can easily make your code less readable with no real benefit.

The JVM is remarkably efficient at allocating new objects -- faster than malloc in C, for instance, because it's much better about recycling memory intelligently.

Unfortunately, many developers have "warning bells" go off when they realize how many objects they are instantiating, and they quickly start making their code less readable by adding "optimizations" everywhere that might in the end save a handful of milliseconds in operations that are 99% IO-bound.

Some even code their own object pools to avoid the dreaded "new" -- often introducing subtle, hard-to-find bugs and again, not gaining anything in performance.

It's almost never necessary. Focus on putting indexes on your database tables first.

Do your String processing with whatever is most readable and easily maintainable. If you seriously think a few String concatenations could be hurting you (maybe you're doing a *lot* in a tight loop?), profile it, make your optimizations, and profile it again to see what you gained.

Some of these things *did* make a difference with the old JVMs. But nowadays, for any optimizations, you should provide actual stats of time gained in a "similar to actual usage" scenario before posting about it. Otherwise, you're risking convincing developers to write less maintainable code for no benefit.

Nirmit Shah said...

Avoid new opreator while creating a String object. it will actually creates two objects - one in StringPool & one in memory(non-pool).
i.e, String s = "String1"; //This will create only one Object in the StringPool
String s2 = new String("String2"); //seems same as prevois one, but this will create two objects.one in StringPool & other in normal(non-pool)memory.

spudtater said...

> Now we know that comparing Strings with == will not solve our problem because == compares the pointer references, and since both Strings are different instances, you will get a false, even if characters match.

...or that's what you'd expect, anyway. I won't pretend to understand what's going on under the scenes, but in fact you can use == for Strings.

Anonymous said...

U dont need to explicitly intern. All constant strings that appear in a class are automatically interned.

Geniot said...

I've written a program which I've jprofiled to a string comparison bottleneck. I'm thinking of converting each string into integers or integer sequences and comparing them instead (my strings are pretty constant - dictionary headwords). The strings are in utf8.
Do you think this should work better if I do this kind of conversion before comparison? Is there a known algorithm for comparing strings by first converting them into integers?

Anonymous said...

Geniot, the Strings are already internally represented as integer sequences (that is what a char[] is). If you look at the existing String.equals() method, it loops through the two char[]s to find out if the Strings have the same "integer sequence". There is no way to further optimise that.

Ahmed said...

Interned Strings DO get garbage collected.

http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html

Anonymous said...

you have a nice site.thanks for sharing this site. various kinds of ebooks are available here

http://feboook.blogspot.com

vivec said...

i am working on a program at the moment which is supposed to read multiple lines of data out of a file. the data is collected and, when a certain amount of data is gathered, concatenated to a string/sql-statement and written to the database. just by avoiding the string being wrapped and unwrapped all the time i cut down the runtime by almonst 50%.