The memcached client interface supports a number of different distribution algorithms that are used in multi-server configurations to determine which host should be used when setting or getting data from a given memcached instance. When you get or set a value, a hash is constructed from the supplied key and then used to select a host from the list of configured servers. Because the hashing mechanism uses the supplied key as the basis for the hash, the selected server will be the same during both set and get operations.
For example, if you have three servers, A, B, and C, and you set
the value myid
, then the
memcached client will create a hash based on
the ID and select server B. When the same key is requested, the
same hash is generated, and the same server, B, will be selected
to request the value.
Because the hashing mechanism is part of the client interface, not the server interface, the hashing process and selection is very fast. By performing the hashing on the client, it also means that if you want to access the same data by the same ID from the same list of servers but from different client interfaces, you must use the same or compatible hashing mechanisms. If you do not use the same hashing mechanism then the same data may be recorded on different servers by different interfaces, both wasting space on your memcached and leading to potential differences in the information.
One way to use a multi-interface compatible hashing mechanism
is to use the libmemcached
library and the
associated interfaces. Because the interfaces for the
different languages (including C, Ruby, Perl and Python) are
using the same client library interface, they will always
generate the same hash code from the ID.
One issue with the client-side hashing mechanism is that when
using multiple servers and extending or shrinking the list of
servers that you have configured for use with
memcached, the resulting hash may change. For
example, if you have servers A, B, and C; the computed hash for
key myid
may equate to server B. If you add
another server, D, into this list, then computing the hash for
the same ID again may result in the selection of server D for
that key.
This means that servers B and D both contain the information for
key myid
, but there may be differences
between the data held by the two instances. A more significant
problem is that you will get a much higher number of
cache-misses when retrieving data as the addition of a new
server will change the distribution of keys, and this will in
turn require rebuilding the cached data on the
memcached instances and require an increase
in database reads.
For this reason, there are two common types of hashing algorithm, consistent and modula.
With consistent hashing algorithms, the
same key when applied to a list of servers will always use the
same server to store or retrieve the keys, even if the list of
configured servers changes. This means that you can add and
remove servers from the configure list and always use the same
server for a given key. There are two types of consistent
hashing algorithms available, Ketama and Wheel. Both types are
supported by libmemcached
, and
implementations are available for PHP and Java.
There are some limitations with any consistent hashing algorithm. When adding servers to an existing list of configured servers, then keys will be distributed to the new servers as part of the normal distribution. When removing servers from the list, the keys will be re-allocated to another server within the list, which will mean that the cache will need to be re-populated with the information. Also, a consistent hashing algorithm does not resolve the issue where you want consistent selection of a server across multiple clients, but where each client contains a different list of servers. The consistency is enforced only within a single client.
With a modula hashing algorithm, the client will select a server by first computing the hash and then choosing a server from the list of configured servers. As the list of servers changes, so the server selected when using a modula hashing algorithm will also change. The result is the behavior described above; changes to the list of servers will mean different servers are selected when retrieving data leading to cache misses and increase in database load as the cache is re-seeded with information.
If you use only a single memcached instance for each client, or your list of memcached servers configured for a client never changes, then the selection of a hashing algorithm is irrelevant, as you will not notice the effect.
If you change your servers regularly, or you use a common set of servers that are shared among a large number of clients, then using a consistent hashing algorithm should help to ensure that your cache data is not duplicated and the data is evenly distributed.
User Comments
Add your own comment.