Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

--pipe is well know for being slow.

Try --pipe-part instead:

    parallel -a ngrams.tsv --pipe-part --block -1 awk -f map.awk |
      awk -f reduce.awk


Thanks for the tip, always nice to see the author of tools commenting on hn :-)

The (old) version of parallel packaged with Ubuntu 16.04 (linux subsystem for windows) - doesn't have --pipe-part -- but running from upstream, the speed is more reasonable:

  $ time (./parallel-20170522/src/parallel -a ngrams.tsv \
    --pipe-part --block -1 -j4 mawk -f map.awk \
    | mawk -f reduce.awk )
  max_key: 2006 sum: 22569013

  real    0m2.265s
  user    0m4.672s
  sys     0m1.672s
(Tried a few variants with/without -jN -- and this seems typical for the fast end of the spectrum).

  $ time (cat ngrams.tsv \
     | mawk -f map.awk \
     | mawk -f reduce.awk )
  max_key: 2006 sum: 22569013

  real    0m3.472s
  user    0m2.891s
  sys     0m2.406s
[ed: btw, did a double-take when I saw your Gnu Privacy Guard id: 0x88888888 :-) ]


--line-buffer may or may not give additional speed up.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: