AN EXAMPLE OF MULTICORE-PROGRAMMING WITH FORKANDRETURN

We’ve got 42 GZ-files. Compressed, that’s 44,093,076 bytes. The question is: How much bytes is it if we decompress it?

We could do this, if we’re on a Unix machine:

$ time zcat *.gz | wc -c
860187300

real    0m7.266s
user    0m5.810s
sys     0m3.341s

Can we do it without the Unix tools? With pure Ruby? Sure, we can:

$ cat count.rb
require "zlib"

count   = 0

Dir.glob("*.gz").sort.each do |file|
  Zlib::GzipReader.open(file) do |io|
    while block = io.read(4096)
      count += block.size
    end
  end
end

puts count

Which indeed returns the correct answer:

$ time ruby count.rb
860187300

real    0m5.687s
user    0m5.499s
sys     0m0.186s

But can we take advantage of both CPU’s? Yes, we can. The plan is to use ForkAndReturn’s Enumerable#concurrent_collect instead of Enumerable#each(). But lets reconsider our code first. First question: What’s the most expensive part of the code? Well, even without profiling, we can say that the iteration over the blocks and the deflating of the compressed archives are the most expensive. Can we concurrently run these blocks of code on several CPU’s? Well, in fact, we can’t. In the while block, we update a global counter. That’s not gonna work if we fork to separate processes. So, we get rid of the local use of the global variable first:

$ cat count.rb
require "zlib"

count   = 0

Dir.glob("*.gz").sort.collect do |file|
  c     = 0

  Zlib::GzipReader.open(file) do |io|
    while block = io.read(4096)
      c += block.size
    end
  end

  c
end.each do |c|
  count += c
end

puts count

Which runs as fast as the previous version:

$ time ruby count.rb
860187300

real    0m5.703s
user    0m5.515s
sys     0m0.218s

We can now run the local counts concurrently, by changing only one word (and requiring the library):

$ cat count.rb
require "zlib"
require "forkandreturn"

count   = 0

Dir.glob("*.gz").sort.concurrent_collect do |file|
  c     = 0

  Zlib::GzipReader.open(file) do |io|
    while block = io.read(4096)
      c += block.size
    end
  end

  c
end.each do |c|
  count += c
end

puts count

$ time ruby count.rb
860187300

real    0m3.860s
user    0m6.511s
sys     0m1.120s

Yep, it runs faster! 1.5x as fast! Not really doubling the speed, but it’s close enough…

But, since all parsing of all files is done concurrently, aren’t we exhausting our memory? And how about parsing thousands of files? Can’t we run just a couple of jobs concurrently, instead of all of them? Sure. Just add one parameter:

$ cat count.rb
require "zlib"
require "forkandreturn"

count   = 0

Dir.glob("*.gz").sort.concurrent_collect(4) do |file|
  c     = 0

  Zlib::GzipReader.open(file) do |io|
    while block = io.read(4096)
      c += block.size
    end
  end

  c
end.each do |c|
  count += c
end

puts count

Et voila:

$ time ruby count.rb
860187300

real    0m3.953s
user    0m6.436s
sys     0m1.309s

A bit of overhead, but friendlier to other running applications.

(BTW, the answer is 860,187,300 bytes.)