NOTE:This blog had a good run, but is now in retirement.
If you enjoy the content here, please support Gregory's ongoing work on the Practicing Ruby journal.

Code Massage

2010-02-28 08:32, written by Robert Klemme

This article started out as a mental experiment and led to a surprising result. I post this mostly for the fun of it. But of course you can take something away from it. With that I do mean not only technical solutions. I believe firmly that a certain level of playfulness actually helps finding better solutions. The other ingredient you need is a certain eagerness for improvement which means to not be be content too early. OK, let’s start.

The scenario I started out was this: suppose you want to anonymize email addresses because you want to publish an email but not expose addresses to spam harvesting. Yet, you want to make sure that every address is always represented with the same replacement address in order to not change the meaning. You might immediately answer “We’ll need a Hash so we can efficiently find addresses that have been replaced already” – and so did I.

Java Style

If you are familiar with the Java standar library you know that java.util.Map has methods to check whether a key is present, to set values and to retrieve values. So after switching to Ruby you might be tempted to do it like this:

subst = {}

puts email.gsub(ADDR) {|match|
  if subst.has_key? match
    subst[match]
  else
    subst[match] = "<<MAIL #{subst.size}>>"
  end
}

Whenever we encounter an email address we must first check whether we have generated a replacement string for this address already. If not, we create a new one. No rocket science.

A little more sophisticated

You might find documumentation of method Hash#fetch when reading the library documentation which comes in handy because the block is invoked if the key is not present in the Hash. The code now looks a little shorter already:

subst = {}

puts email.gsub(ADDR) {|match|
  subst.fetch(match) {|k| subst[k] = "<<MAIL #{subst.size}>>" }
}

O||=erator

A similar thing can be achieved with the ubiquituous operator ||= which allows for conditional execution. In case you are not yet familiar with it you’ll find plenty of discussions in ruby-talk that revolve around this. The short summary is that a ||= b is equivalent to a || a = b and not a = a || b as you might be tempted to believe.

subst = {}

puts email.gsub(ADDR) {|match|
  subst[match] ||= "<<MAIL #{subst.size}>>"
}

The code has become even more shorter. But we are not finished yet!

Outsourcing

If you need that replacement in multiple places of your code you’ll likely put it into a method. However, if you want to replace different things (i.e. you need different regular expressions) which can match the same string you might want to outsource the generation of the replacement string so you can use it with different calls of gsub. You can of course do it with an additional method but there is a more elegant way to do it:

subst = Hash.new {|h,k| h[k] = "<<MAIL #{h.size}>>"}

puts email.gsub(ADDR) {|match| subst[match]}

We simply use Hash's default proc functionality for this. This is basically the same the fetch block does but now the code is attached to the Hash instance and not to the #fetch call.

You might wonder, how much further can we get? And indeed, this solution is probably the most idiomatic one and the one you see most frequent in seasoned Ruby developers’ code. It turns out though, that we can drive this further if we are prepared to use some newer Ruby features.

Getting tricky

Since Ruby 1.8.7 you can use anything as a block parameter to a method provided it implements a method to_proc which returns a Proc. Namely class Symbol implements this method in the following way: it returns a proc which needs at least one argument when called and invokes the given method with the remaining arguments on that instance. This allows for convenient operations like mapping data:

irb(main):001:0> (1..3).map &:to_s
=> ["1", "2", "3"]

One thing that bugged me was that the block handed to gsub above does nothing more than basically only forward the Hash lookup. With the new feature it should be possible to make the code a bit more concise. Luckily there are some core classes that implement to_proc already:

$ ruby -e 'ObjectSpace.each_object(Module) {|m| p m if m.instance_methods.include? "to_proc"}'
Method
Proc
Symbol

$ ruby19 -e 'ObjectSpace.each_object(Module) {|m| p m if m.instance_methods.include? :to_proc}'
Method
Proc
Symbol

We can exploit this fact and now we can write the code like this:

subst = Hash.new {|h,k| h[k] = "<<MAIL #{h.size}>>"}

puts email.gsub(ADDR, &subst.method(:[]))

Note: code changed after arthurschreiber’s comment.

Now, that looks ugly, doesn’t it? We should be able to do something about that, because after all we love Ruby for its elegance and clear syntax. Yes, we can!

Even shorter with a general solution

Since we can use any object why not provide a general mechanism for this case? Not only Hash but also Array and a lot more classes provide method [] as a general hook for lookup or exeution:

$ ruby19 -e 'ObjectSpace.each_object(Module) {|m| p m if m.instance_methods.include? :[]}'
Thread
Method
Proc
Struct::Tms
MatchData
Struct
Hash
Array
Bignum
Fixnum
Symbol
String

Now, let’s allow all these to be simply used as block parameters!

class Object
  def to_proc(m = :[])
    method(m).to_proc
  end
end

subst = Hash.new {|h,k| h[k] = "<<MAIL #{h.size}>>"}

puts email.gsub(ADDR, &subst)

Now we can just pass any Hash instance to gsub. The working logic for calculating our replacement string is now completely restricted to the Hash creation. This is a really elegant solution!

Golf

We can reduce the number of characters to type a bit more by throwing out the variable declaration and effectively turn this into a one liner:

puts email.gsub(ADDR, &Hash.new {|h,k| h[k] = "<<MAIL #{h.size}>>"})

I don’t think this is an improvement over the last variant but sometimes it helps driving things as far as possible to find out where in the process we reached the optimum.

The fun begins

Some classes do also implement method [] – we should be able to make good use of that as well. We might be tempted to create a lot of Struct instances via this method. It can be done but we have to do some tweaking because Struct.[] does not splat a single Array argument so we have to redefine it a bit:

Name = Struct.new :forename, :surname

# unfortunately this does not work with the default Struct.[]
def Name.[](a)
  new(*a)
end

p [
  ["John", "Doe"],
  ["John", "Cleese"],
  ["Mickey", "Mouse"],
].map(&Name)

# maybe a bit better:
def Name.create(a)
  new(*a)
end

p [
  ["John", "Doe"],
  ["John", "Cleese"],
  ["Mickey", "Mouse"],
].map(&Name.to_proc(:create))

I hope you had some fun reading this and more importantly playing around yourself. Trying out all things will certainly help you discover new ways and improve your skills.

As usually I have placed the code at github . If you look at it, please don’t get yourself hung up on the regular expression for matching email addresses. This is a whole topic of its own and I just hacked something together to make the code work.

blog comments powered by Disqus