Bug ID: JDK-4045622 java.lang.String.hashCode spec incorrectly describes the hash algorithm

SUGGESTED FIX The following is the accepted proposal for the new string hash function, followed by a report justifying its selection: New String Hash Function (Bugs 4045622, 1258091) ================================================ The Problem: The currently specified String hash function does not match the currently implemented function. The specified function is not implementable. (It addresses characters outside of the input string.) The implemented function performs very poorly on certain classes of strings, including URLs. (The poor performance is due to the "sampling" nature of the function for strings over 15 characters.) I view the specification problem as the perfect opportunity to replace the unfortunate implementation. Requesters: The problems with the implementation have been mentioned on comp.sys.java.lang.programmer, though the extent of the problem may not be known outside of JavaSoft. The problems with the spec were discovered by Peter Kessler and myself. Proposed API change: No API would change, per se. The function computed by String.hashCode() would change to: s[0] * 31^(n-1) + s[1] * 31^(n-2) + ... + s[n-1] where s[i] is the ith character of string s. The Java Language Specification (which specifies the value to be returned by String.hashCode()) would be modified to reflect this. The new hash function was selected after a fair amount of study, as described in Exhibit A. In the unlikely event that you want even more detail, see me. Implementation: Trivial. (4 lines of code, which have already been written.) Performance impact: Hashing large strings will be somewhat more expensive, as the new hash function examines every character, but Hashtables performance will, on balance, improve, as hash collisions will be vastly reduced. While one could construct a program to demonstrate the reduced speed of the new hash function, I would be very surprised if any real applications were adversely affected in any measurable way. Risk Assessment: Surprisingly small. Serialization of Hashtables does not depend on consistent hash values. It is possible that some customer has implemented some persistent data structure that relies on the String hash function not changing, but it is highly unlikely that more than a handful of customers have done so. Testing Impact: None, really. I've performed a fair number of tests on the proposed hash function already. (See below.) Doc Impact: The release notes would have to mention the change, so that any "researchy" customers who do rely on the precise behavior of the hash function would be alerted. The JLS would have to be modified in its next edition. The actual text would be very short (1-2 paragraphs). ****************************************************************************** Exhibit A As per Guy's request, here's a note describing the "Pedigree" of my proposed String hash function. Surprisingly, I could find very little in the way of published literature on practical, general-purpose string hash functions; much of what is published is either outdated (e.g., Knuth V. 3, Pp. 508-513) or arcane (lots of journal articles on 'perfect hashing'). Most of the standard algorithms and data structures textbooks barely mention the subject. The best treatment I was able to find was in Aho, Sethi and Ullman's "Compilers: Principles, Techniques and Tools" AKA "The Dragon Book" (Pp. 433-438). The class of hash function that I recommend using is polynomials whose coefficients are the characters in the string: P(x) = s[0] * x^(n-1) + s[1] * x^(n-2) + ... + s[n-1] This polynomial is calculated by the following code fragment: int hashVal = 0; for (int i=0; i<string.length; i++) hashVal = x*hashVal + s.charAt(i); The value of x is part of the definition of the hash function; choosing this value is somewhat of a black art. Clearly 1 or any multiple of 2 is bad. Primes seem safer than composite numbers, though some composite numbers yield excellent performance (as shown below). While this class of hash function is recommended in The Dragon Book (P(65599)) and Kernighan and Ritchie's "The C Programming Language, 2 Ed." (P(31)), it is not attributed in either of these books. I went so far as to call up Dennis Ritchie, who said that he did not know where the hash function came from. He walked across the hall and asked Brian Kernighan, who also had no recollection. A shareware library from Germany called "Matpack" contains ten string hash functions, including P(33), which it attributes to noted Unix hacker Chris Torek. I sincerely doubt that it's original with him. P(37) is used in the definition of the current Java String hash function for short (< 16 characters) strings. I'm afraid that the origins of this little function are lost to history. So why do I think we should use this function? Simply put, it's the best general purpose string hash function that I was able to find, and it's cheap to calculate. By 'general purpose', I mean that it's not optimized for any specific type of strings (e.g., compiler symbols), but seems to resist collisions across every collection of strings that I was able to throw at it. This is critical given that we have no idea what sort of strings people will store in Java hash tables. Also, the performance of this class of hash functions seems largely unaffected by whether the size of the hash table is prime or not. This is important because Java's auto-growing hash table is typically does not contain a prime number of buckets. In addition to the class of hash functions described above, The Dragon Book highly recommends a slightly more complex hash function due to P. J. Weinberger, originally used internally in his C compiler. Matpak's author, Dr. Berndt M. Gammel (of Technische Universit??t M??nchen), however, claims that Weinberger's function does not work well for large collections of arbitrary strings. It turns out that Matpak's implementation of Weinberger's function contains a typo, and the resulting function is far, far worse even than the currently implemented Java string hash function (see table below). I notified Dr. Gammel, and he promised to fix the typo. The published version of Weinberger's function (in the Dragon book) differs slightly from the version that he (Peter Weinberger) personally recommends. (They differ in the number of bits to which the hash value is shifted: the published value is 24, the recommended value is 28.) The recommended version is significantly better (see Table). Gammel recommends only three of the ten hash functions in Matpak for "general use." One is P(33), one is a minor variant on P(129), ascribed to Phong Vo, and the third is a table lookup based, S-box like algorithm, relying on a table of 256 "random" 32-bit numbers, from the WAIS project. (The Vo variant merely adds the constant 987654321 to each character to form the polynomial coefficients.) The table below summarizes the performance of the various hash functions described above, for three data sets: 1) All of the words and phrases with entries in Merriam-Webster's 2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars). 2) All of the strings in /bin/*, /usr/bin/*, /usr/lib/*, /usr/ucb/* and /usr/openwin/bin/* (66,304 strings, avg length 21 characters). 3) A list of URLs gathered by a web-crawler that ran for several hours last night (28,372 strings, avg length 49 characters). The performance metric shown in the table is the "average chain size" over all elements in the hash table (i.e., the expected value of the number of key compares to look up an element). Webster's Code Strings URLs --------- ------------ ---- Current Java Fn. 1.2509 1.2738 13.2560 P(37) [Java] 1.2508 1.2481 1.2454 P(65599) [Aho et al] 1.2490 1.2510 1.2450 P(31) [K+R] 1.2500 1.2488 1.2425 P(33) [Torek] 1.2500 1.2500 1.2453 Vo's Fn 1.2487 1.2471 1.2462 WAIS Fn 1.2497 1.2519 1.2452 Weinberger's Fn(MatPak) 6.5169 7.2142 30.6864 Weinberger's Fn(24) 1.3222 1.2791 1.9732 Weinberger's Fn(28) 1.2530 1.2506 1.2439 Looking at this table, it's clear that all of the functions except for the current Java function and the two broken versions of Weinberger's function offer excellent, nearly indistinguishable performance. I strongly conjecture that this performance is essentially the "theoretical ideal", which is what you'd get if you used a true random number generator in place of a hash function. I'd rule out the WAIS function as its specification contains pages of random numbers, and its performance is no better than any of the far simpler functions. Any of the remaining six functions seem like excellent choices, but we have to pick one. I suppose I'd rule out Vo's variant and Weinberger's function because of their added complexity, albeit minor. Of the remaining four, I'd probably select P(31), as it's the cheapest to calculate on a RISC machine (because 31 is the difference of two powers of two). P(33) is similarly cheap to calculate, but it's performance is marginally worse, and 33 is composite, which makes me a bit nervous. Josh ****************************************************************************** From: Guy Steele - Sun Microsystems Labs <###@###.###> To: jbloch@Eng Cc: gls@East, James.Gosling@Eng, wnj@central, ###@###.###, riggs@East, ###@###.### Subject: Re: String Hash Function Redux (long) Date: Mon, 28 Apr 1997 09:47:22 -0400 (EDT) Good work! Besides the fact that you have clearly established that we ought to revamp Java's hash function and have found a good solution, I think you have an MPU here (Minimum Publishable Unit)---that is, I would encourage you to write this up even more carefully and send it to an ACM conference (PLDI?) or journal. --Guy

Tumors on June 12nd, 2017 at 12:11 UTC »

This type of string hash function is sometimes referred to as djb2, named after Daniel Bernstein although he only claims to adding an initial value of 5381 to an existing algorithm. In the same usenet discussion, Chris Torek (mentioned in the op) also talks about the hash function a great deal.

edit: here Torek claims he stole the hash function from Gosling Emacs which was created by James Gosling, the creator of Java

namekuseijin on June 12nd, 2017 at 10:32 UTC »

Fascinating bit of quite recent (1997) historic trivia going back even farther (70's). That's good digital archeology, a hacking art itself.

If neither Steele, Ritchie, Kernighan and many of those serious computer scientists from the 70's know its origins, we can be pretty sure it's alien tech, probably first implemented in Lisp. ;-)

mariopenna on June 12nd, 2017 at 10:24 UTC »

That's a dumb question. Of course it was Satoshi Nakamoto!