Faster command line tools with Haskell

Posted on July 30, 2017

Retraction/Update : The Haskell program at the end of this article was not faster. Friends don’t let friends benchmark after dark. I seem to have made a benchmarking mistake and measured the total time of the benchmark in the Go program vs one iteration of the Haskell program. I was in a hurry for this entire post and rushed through a few pieces of it due to time constraints. If I’d had more time, I would have written a Criterion benchmark for both the Go and Haskell program. In the future I think I’ll do so to reduce the possibility of mistakes like this.

Inspired by Faster command line tools with Go which itself was inspired by Faster Command Line Tools in D.

The original problem statement:

It’s a common programming task: Take a data file with fields separated by a delimiter (comma, tab, etc), and run a mathematical calculation involving several of the fields. Often these programs are one-time use scripts, other times they have longer shelf life. Speed is of course appreciated when the program is used more than a few times on large files.

The specific exercise we’ll explore starts with files having keys in one field, integer values in another. … Fields are delimited by a TAB, and there may be any number of fields on a line. The file name and field numbers of the key and value are passed as command line arguments.

What I wish to be true:

I write high level Haskell code as declaratively and naievely as possible and sufficiently smart compiler optimizations make it as fast as hand tuned Go code.

Let’s see if it works out that way!

We’ll use the same file described in Faster command line tools with Go :

The data we will use is a file of n-grams from the Google Books dataset. The file is 184 Megabytes, uncompressed, and has a total of 10,512,769 lines.

I’ll also track progress in a Github Repo so others can easily reproduce my tests.

First, lets set a baseline by getting the runtime for the fastest Go program from that post. I’m using Go 1.6 which is missing quite a few performance improvements, so I’ll update to Go 1.8 using the ppa:gophers/archive ppa.

All done, now the fastest version of the Golang code that operates on bytes directly can be found at the bottom of , but it’s short enough to paste inline here (note I made small modifications because it wouldn’t build directly from the blog post):

func processLine(b []byte) (int, int) {
	key := 0
	val := 0
	i := 0
	for b[i] != '\t' {
		i++
	}
	for i++; b[i] != '\t'; i++ {
		key = key*10 + int(b[i]) - '0'
	}
	for i++; b[i] != '\t'; i++ {
		val = val*10 + int(b[i]) - '0'
	}
	return key, val
}

func processFile(file *os.File) (int, int) {
	file.Seek(0, 0)
	var sumByKey [2009]int

	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Bytes()
		k1, v1 := processLine(line)
		sumByKey[k1] += v1
	}
	var k int
	var v int
	for i, val := range sumByKey {
		if val > v {
			k = i
			v = val
		}
	}
	return k, v

}

Benchmark:

package main

import (
	"os"
	"testing"
)

func Benchmark_processFile(b *testing.B) {
	file, err := os.Open("../ngrams.tsv")
	defer file.Close()
	if err != nil {
		b.Fatal(err)
	}
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		k, v := processFile(file)
		if k != 2006 || v != 22569013 {
			b.Fatalf(`bad result %v | %v`, k, v)
		}
	}
}

And it’s runtime:

cody@zentop:~/faster-command-line-tools-with-haskell/go$ go test -v -bench=.
Benchmark_processFile-4                5         252309722 ns/op
PASS
ok      _/home/cody/faster-command-line-tools-with-haskell/go   2.274s
cody@zentop:~/faster-command-line-tools-with-haskell/go$ go test -v -bench=.
Benchmark_processFile-4                5         252163852 ns/op
PASS
ok      _/home/cody/faster-command-line-tools-with-haskell/go   2.274s
cody@zentop:~/faster-command-line-tools-with-haskell/go$ go test -v -bench=.
Benchmark_processFile-4                5         252327565 ns/op
PASS
ok      _/home/cody/faster-command-line-tools-with-haskell/go   2.274s

Now, I want to create a Haskell version and compare it’s performance and if necessary, the process of profiling performance!

Navigating back…

cd /home/cody/faster-command-line-tools-with-haskell

Creating a new Haskell project using LTS Haskell 9.0 (ghc-8.0.2):

stack new haskell-version simple --resolver lts-9.0

We’ll use the bytestring package by adding it to haskell-version.cabal:

name:                haskell-version
version:             0.1.0.0
... snip ...
build-depends:       base >= 4.7 && < 5
                     , bytestring >= 0.10.8.11

Now instead of porting the most performant version, I want to focus on the high level problem and write the simplest and most declarative solution similar to how Faster command line tools with Go’s author did.

My hypothesis (and hope!) is that the high level Haskell version will be competitive with the hand-optimized Go version.

Their first version was:

func processFile(filePath string, keyIndex, valueIndex int) {
	delim := "\t"

	fileHandle, err := os.Open(filePath)
	defer fileHandle.Close()

	maxFieldIndex := int(math.Max(float64(keyIndex), float64(valueIndex)))
	sumByKey := make(map[string]int)

	if err != nil {
		log.Fatal(err)
	}

	fileReader := bufio.NewScanner(fileHandle)

	for fileReader.Scan() {
		fields := strings.Split(strings.TrimSpace(fileReader.Text()), delim)

		if maxFieldIndex < len(fields) {
			value, err := strconv.Atoi(fields[valueIndex])
			if err != nil {
				log.Fatal(err)
			}
			sumByKey[fields[keyIndex]] += value
		}
	}

	maxValue := 0
	maxKey := ""
	for k, v := range sumByKey {
		if v > maxValue {
			maxValue = v
			maxKey = k
		}
	}
	fmt.Println("max_key:", maxKey, "sum:", sumByKey[maxKey])

}

We’ll also use a strict (I usually use strict unless I have good reason not to) Map in the Haskell version, meaning we’ll want to add containers to our cabal file:

name:                haskell-version
version:             0.1.0.0
... snip ...
build-depends:       base >= 4.7 && < 5
                   , bytestring >= 0.10.8.11
                   , containers >= 0.5.7.1

A first Haskell version looks like:

module Main where

import qualified Data.ByteString.Char8 as C8
import qualified Data.Map.Strict as M
import Data.List

toInt :: C8.ByteString -> Int
toInt = read . C8.unpack

processFile = do
  bsLines <- C8.lines <$> C8.readFile "../ngrams.tsv"
  let info = C8.splitWith (=='\t') <$> bsLines :: [[C8.ByteString]]
  let m = foldl'
             (\acc x -> let (_:key:val:_) = (x :: [C8.ByteString]) in M.insertWith (\new old -> old + new) key (toInt val) acc)
             M.empty
             info
  pure $ last $ sortOn snd (M.toList m)

main :: IO ()
main = do
  x <- processFile
  print x

Let’s run it:

cody@zentop:~/faster-command-line-tools-with-haskell/haskell-version$ stack build
cody@zentop:~/faster-command-line-tools-with-haskell/haskell-version$ time stack exec -- haskell-version
("2006",22569013)

real    0m10.960s
user    0m10.828s
sys     0m0.140s

Alright, we are 5x slower than the hand optimized Go code. Let’s see why. NOTE that runtime for profiled builds will be much slower as in any language.

cody@zentop:~/faster-command-line-tools-with-haskell/haskell-version$ stack clean && stack build --profile && stack exec -- haskell-ver
sion -p                                                                                                                               
haskell-version-0.1.0.0: configure (exe)
Configuring haskell-version-0.1.0.0...
haskell-version-0.1.0.0: build (exe)
Preprocessing executable 'haskell-version' for haskell-version-0.1.0.0...
[1 of 1] Compiling Main             ( src/Main.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version-t
mp/Main.p_o )                                                                                                                         
Linking .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version ...
haskell-version-0.1.0.0: copy/register
Installing executable(s) in
/home/cody/faster-command-line-tools-with-haskell/haskell-version/.stack-work/install/x86_64-linux/lts-9.0/8.0.2/bin
("2006",22569013)
cody@zentop:~/faster-command-line-tools-with-haskell/haskell-version$ cat haskell-version.prof 
        Sun Jul 30 22:51 2017 Time and Allocation Profiling Report  (Final)

           haskell-version +RTS -p -RTS

        total time  =       20.74 secs   (20742 ticks @ 1000 us, 1 processor)
        total alloc = 49,046,382,928 bytes  (excludes profiling overheads)

COST CENTRE      MODULE    SRC                         %time %alloc

toInt            Main      src/Main.hs:8:1-24           56.5   54.5
processFile.info Main      src/Main.hs:12:7-67          22.0   26.3
processFile.m.\  Main      src/Main.hs:14:25-126        14.7   14.7
processFile      Main      src/Main.hs:(10,1)-(17,39)    6.2    4.2


                                                                                               individual      inherited
COST CENTRE                MODULE                SRC                        no.     entries  %time %alloc   %time %alloc

MAIN                       MAIN                  <built-in>                  63          0    0.0    0.0   100.0  100.0
 CAF                       GHC.IO.Handle.FD      <entire-module>            108          0    0.0    0.0     0.0    0.0
 CAF                       GHC.IO.Handle.Text    <entire-module>            106          0    0.0    0.0     0.0    0.0
 CAF                       GHC.IO.Encoding       <entire-module>            100          0    0.0    0.0     0.0    0.0
 CAF                       GHC.IO.FD             <entire-module>             99          0    0.0    0.0     0.0    0.0
 CAF                       Text.Read.Lex         <entire-module>             94          0    0.0    0.0     0.0    0.0
 CAF                       GHC.Conc.Signal       <entire-module>             92          0    0.0    0.0     0.0    0.0
 CAF                       GHC.IO.Encoding.Iconv <entire-module>             84          0    0.0    0.0     0.0    0.0
 CAF:main1                 Main                  <no location info>         124          0    0.0    0.0     0.0    0.0
  main                     Main                  src/Main.hs:(20,1)-(22,9)  126          1    0.0    0.0     0.0    0.0
 CAF:main2                 Main                  <no location info>         122          0    0.0    0.0     0.0    0.0
  processFile              Main                  src/Main.hs:(10,1)-(17,39) 128          1    0.0    0.0     0.0    0.0
 CAF:main7                 Main                  <no location info>         117          0    0.0    0.0     0.0    0.0
  processFile              Main                  src/Main.hs:(10,1)-(17,39) 130          0    0.0    0.0     0.0    0.0
 CAF:toInt                 Main                  src/Main.hs:8:1-5          121          0    0.0    0.0     0.0    0.0
  toInt                    Main                  src/Main.hs:8:1-24         136          1    0.0    0.0     0.0    0.0
 CAF:toInt3                Main                  <no location info>         120          0    0.0    0.0     0.0    0.0
  toInt                    Main                  src/Main.hs:8:1-24         138          0    0.0    0.0     0.0    0.0
 main                      Main                  src/Main.hs:(20,1)-(22,9)  127          0    0.0    0.0   100.0  100.0
  processFile              Main                  src/Main.hs:(10,1)-(17,39) 129          0    6.2    4.2   100.0  100.0
   processFile.info        Main                  src/Main.hs:12:7-67        131          1   22.0   26.3    22.0   26.3
   processFile.m           Main                  src/Main.hs:(13,7)-(16,17) 132          1    0.4    0.0    71.9   69.5
    processFile.m.\        Main                  src/Main.hs:14:25-126      133   10512769   14.7   14.7    71.4   69.5
     processFile.m.\.(...) Main                  src/Main.hs:14:29-66       134   10512769    0.0    0.0     0.0    0.0
     processFile.m.\.key   Main                  src/Main.hs:14:29-66       135   10512769    0.0    0.0     0.0    0.0
     processFile.m.\.val   Main                  src/Main.hs:14:29-66       139   10512769    0.0    0.0     0.0    0.0
     processFile.m.\.\     Main                  src/Main.hs:14:97-105      140   10512355    0.3    0.3     0.3    0.3
     toInt                 Main                  src/Main.hs:8:1-24         137          0   56.5   54.5    56.5   54.5

Alright, it looks like most of the time is spent constructing info which is our map of years and sums. However I also see that 56.5% of time is spent in the toInt function. Let’s revisit it:

toInt :: C8.ByteString -> Int
toInt = read . C8.unpack

There is a recent bytestring-conversions package by Toralf Wittner which does this using typeclasses and an older library by Don Stewart called bytestring-lexing. I haven’t used either, so first I’ll try out bytestring-lexing since Don has created many simple and performant libraries I’ve used in the past.

Our cabal file should now look something like this:

name:                haskell-version
version:             0.1.0.0
... snip ...
build-depends:       base >= 4.7 && < 5
                   , bytestring >= 0.10.8.11
                   , bytestring-lexing
                   , containers >= 0.5.7.1

Then the code change is as follows:

+ import Data.ByteString.Lex.Integral

toInt :: C8.ByteString -> Int
- toInt = read . C8.unpack
+ toInt = readDecimal_

Now we run again (be sure to include stack clean just in case, using stack 1.4 I had issues with –profile build being “sticky”):

stack clean && stack build && time stack exec -- haskell-version
haskell-version-0.1.0.0: configure (exe)
Configuring haskell-version-0.1.0.0...
haskell-version-0.1.0.0: build (exe)
Preprocessing executable 'haskell-version' for haskell-version-0.1.0.0...
[1 of 1] Compiling Main             ( src/Main.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version-tmp/Main.o )
Linking .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version ...
haskell-version-0.1.0.0: copy/register
Installing executable(s) in
/home/cody/faster-command-line-tools-with-haskell/haskell-version/.stack-work/install/x86_64-linux/lts-9.0/8.0.2/bin
("2006",22569013)
3.491 secs

WOW! We are down to 3.491 seconds with one simple(?) change from the naieve version.

Many might be asking: Once you identified toInt as a bottleneck, how did you find bytestring-lexing or bytestring-conversions? Did you know about them in advance?

I actually did not know about them in advance. I searched “haskell bytestring to int” on Google. I was presented with the results below. Note that I’ve annotated with [] my choice in picking or dismissing each of these search results.

Annotated Google Results:

Data.ByteString.Conversion - Hackage 
bytestring-conversion-0.2: Type-classes to convert values to and from ByteString. Safe Haskell, None. Language, Haskell2010 ... Parse ByteString s. Methods.
>>>> PERFECT MATCH! I do wonder if there are other libraries though. Toralf wittner has that great and fast cql-io library I've used before. Maybe I can stop here. Ah well, I'll look for more.

Data.ByteString.Conversion.To - Hackage
bytestring-conversion-0.2: Type-classes to convert values to and from ByteString. Safe Haskell, None. Language ... ToByteString Int · ToByteString Int8.
>>>> PASS, this is just a subpage of the library I've made a mental note of above

haskell - What is the best way to convert a ByteString to an Int? - Stack ...
Jan 16, 2012 - Show is used to create a String representation of something, that is useful for debugging and plain-text serialization. The Show typeclass is not ...
>>>> PASS (sadly none of the answers helped me much) Oh good, a stack overflow answer where an expert might have answered

How to convert a Integer to a ByteString in Haskell - Stack Overflow
Feb 17, 2010 - to yield lazy bytestrings, which can of course be converted to strict ones. The cereal package provides much the same interface, but yields strict bytestrings only (so no infinite streaming of encodings). Have a look at the binary package, or any of its non-lazy variants: cereal or binary-strict .
>>>> PASS, but I'm not totally sure why I passed it up. I think because it was mentioning binary a lot, which is related to the problem of turning some bytes into an int rather than a bytestring literal into an int

haskell - How to convert a ByteString to an Int and dealing with ...
Jan 15, 2013 - I need to read a binary format in Haskell. The format is fairly simple: ... The binary package contains tools to get integer types of various sizes and ...
>>>> PASS.. This one lured me in, but ended up being more about binary

haskell - How do I convert 4 bytes of a ByteString to an Int? - Stack ...
Oct 23, 2014 - How can I convert a ByteString to an Int? I have the following 4 bytes ... For binary parsing, have a look at the binary package ... I managed to get ...
>>>> PASS... more binary

Haskell: "Reading" ByteString - Stack Overflow
Mar 23, 2012 - You can use readInt or readInteger from Data.ByteString.Char8 . If you want to read some other type of data, you'll need to write your own ...
>>>> PASS. This one looked promising, but the one below it caught my eye before I clicked through

performance - Efficient number reading in Haskell - Stack Overflow
Dec 20, 2010 - The only time I encountered parsing doubles on the critical path, I used this: ... Update: OK, seems that readDouble is in package bytestring-lexer which is not installed by default. Any other idea? performance parsing haskell ...
>>>> WINNER..

When you get to that final search results stack overflow page you’ll see this answer:

The only time I encountered parsing doubles on the critical path, I used this:

    {-# LANGUAGE ForeignFunctionInterface #-}
    import qualified Data.ByteString.Char8 as B
    import Foreign.C.Types
    import Foreign.C.String
    import System.IO.Unsafe
    
    foreign import ccall unsafe "stdlib.h atof" c_atof :: CString -> IO CDouble
    unsafeReadDouble = unsafePerformIO . flip B.useAsCString c_atof

There wasn't anything that looked like a readDouble in bytestring at that time, though. That would probably be a better solution if it's now standard.

- JB.

I was not satisfied with this answer, though I was willing to use it if necessary. Then I scrolled down, glanced at the name, and was happy to see Don Stewart’s name since he is a famous Haskeller that highly values simplicity and performance. Author of such blog posts as Write Haskell as fast as C: exploiting strictness, laziness and recursion, Engineering Large Projects in a Functional Language, and author of an automated gc tuner for GHC announced in the blog post ghc-gc-tune: Tuning Haskell GC settings for fun and profit.

Don’s answer was as follows:

Another solution: install the bytestring-lexing package, and use readDouble, which I optimized for you.

  cabal install bytestring-lexing

The package provides optimized parsing functions for floating point literals:

  readDouble :: ByteString -> Maybe (Double, ByteString)

This is what led me to make the simple fix that made our program roughly 3x faster. Is finding the correct library and making that small change simple? I believe so. Now next time I need to do this task I’ll know what library I need.

However, Go had a fast solution in the standard library: atoi

Ideally I think, Haskell would have some fast equivalent of atoi in the standard library that works with Text and ByteString. However that also brings up the discussion about Haskell ideally using Text as a default (which I’ll note is becoming more possible as Backpack progresses)

Now, let’s look at the profile for the faster version:

~/f/haskell-version:master*? λ stack clean && stack build --profile && stack exec -- haskell-version +RTS -p
bytestring-lexing-0.5.0.2: configure
bytestring-lexing-0.5.0.2: build
bytestring-lexing-0.5.0.2: copy/register
haskell-version-0.1.0.0: configure (exe)
Configuring haskell-version-0.1.0.0...
haskell-version-0.1.0.0: build (exe)
Preprocessing executable 'haskell-version' for haskell-version-0.1.0.0...
[1 of 1] Compiling Main             ( src/Main.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version-tmp/Main.p_o )
Linking .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version ...
haskell-version-0.1.0.0: copy/register
Installing executable(s) in
/home/cody/faster-command-line-tools-with-haskell/haskell-version/.stack-work/install/x86_64-linux/lts-9.0/8.0.2/bin
Completed 2 action(s).
("2006",22569013)
~/f/haskell-version:master*? λ cat haskell-version.prof 
	Sun Jul 30 23:42 2017 Time and Allocation Profiling Report  (Final)

	   haskell-version +RTS -p -RTS

	total time  =        7.92 secs   (7924 ticks @ 1000 us, 1 processor)
	total alloc = 22,528,821,312 bytes  (excludes profiling overheads)

COST CENTRE        MODULE                       SRC                                                   %time %alloc

processFile.info   Main                         src/Main.hs:13:7-67                                    51.5   57.3
processFile.m.\    Main                         src/Main.hs:15:25-126                                  31.9   32.0
processFile        Main                         src/Main.hs:(11,1)-(18,39)                             11.6    9.1
readDecimal_.start Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(260,5)-(265,32)    2.3    0.7
processFile.m.\.\  Main                         src/Main.hs:15:97-105                                   1.1    0.7


                                                                                                                                      individual      inherited
COST CENTRE                      MODULE                       SRC                                                  no.     entries  %time %alloc   %time %alloc

MAIN                             MAIN                         <built-in>                                           156          0    0.0    0.0   100.0  100.0
 CAF                             GHC.IO.Handle.FD             <entire-module>                                      198          0    0.0    0.0     0.0    0.0
 CAF                             GHC.IO.Handle.Text           <entire-module>                                      196          0    0.0    0.0     0.0    0.0
 CAF                             GHC.IO.Encoding              <entire-module>                                      190          0    0.0    0.0     0.0    0.0
 CAF                             GHC.IO.FD                    <entire-module>                                      189          0    0.0    0.0     0.0    0.0
 CAF                             GHC.Conc.Signal              <entire-module>                                      184          0    0.0    0.0     0.0    0.0
 CAF                             GHC.IO.Encoding.Iconv        <entire-module>                                      176          0    0.0    0.0     0.0    0.0
 CAF:main1                       Main                         <no location info>                                   310          0    0.0    0.0     0.0    0.0
  main                           Main                         src/Main.hs:(21,1)-(23,9)                            312          1    0.0    0.0     0.0    0.0
 CAF:main2                       Main                         <no location info>                                   308          0    0.0    0.0     0.0    0.0
  processFile                    Main                         src/Main.hs:(11,1)-(18,39)                           314          1    0.0    0.0     0.0    0.0
 CAF:main7                       Main                         <no location info>                                   306          0    0.0    0.0     0.0    0.0
  processFile                    Main                         src/Main.hs:(11,1)-(18,39)                           316          0    0.0    0.0     0.0    0.0
 CAF:readDecimal__$sreadDecimal_ Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:246:1-12         280          0    0.0    0.0     0.0    0.0
  readDecimal_                   Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(246,1)-(335,59) 323          1    0.0    0.0     0.0    0.0
 CAF:toInt                       Main                         src/Main.hs:9:1-5                                    307          0    0.0    0.0     0.0    0.0
  toInt                          Main                         src/Main.hs:9:1-20                                   322          1    0.0    0.0     0.0    0.0
 main                            Main                         src/Main.hs:(21,1)-(23,9)                            313          0    0.0    0.0   100.0  100.0
  processFile                    Main                         src/Main.hs:(11,1)-(18,39)                           315          0   11.6    9.1   100.0  100.0
   processFile.info              Main                         src/Main.hs:13:7-67                                  317          1   51.5   57.3    51.5   57.3
   processFile.m                 Main                         src/Main.hs:(14,7)-(17,17)                           318          1    0.9    0.0    36.9   33.7
    processFile.m.\              Main                         src/Main.hs:15:25-126                                319   10512769   31.9   32.0    36.0   33.7
     processFile.m.\.(...)       Main                         src/Main.hs:15:29-66                                 320   10512769    0.0    0.0     0.0    0.0
     processFile.m.\.key         Main                         src/Main.hs:15:29-66                                 321   10512769    0.0    0.0     0.0    0.0
     processFile.m.\.val         Main                         src/Main.hs:15:29-66                                 326   10512769    0.0    0.0     0.0    0.0
     processFile.m.\.\           Main                         src/Main.hs:15:97-105                                329   10512355    1.1    0.7     1.1    0.7
     readDecimal_                Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(246,1)-(335,59) 324          0    0.6    0.0     3.1    0.9
      readDecimal_.start         Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(260,5)-(265,32) 325   10512769    2.3    0.7     2.5    0.9
       readDecimal_.loop0        Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(268,5)-(274,32) 327   10512769    0.2    0.0     0.2    0.2
        readDecimal_.loop1       Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(278,5)-(284,52) 328    2272392    0.1    0.1     0.1    0.2
         readDecimal_.loop2      Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(285,5)-(291,53) 330     327077    0.0    0.0     0.0    0.0
          readDecimal_.loop3     Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(292,5)-(298,54) 331      43808    0.0    0.0     0.0    0.0
           readDecimal_.loop4    Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(299,5)-(305,55) 332       6556    0.0    0.0     0.0    0.0
            readDecimal_.loop5   Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(306,5)-(312,56) 333        724    0.0    0.0     0.0    0.0
             readDecimal_.loop6  Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(313,5)-(319,57) 334         71    0.0    0.0     0.0    0.0

The majority of the time is spent in building up the map. Is there anyway we can speed that up? From previous knowledge, I know that there is a Data.IntMap.Strict module and I’m almost sure that inserting ints would be faster. We can trade building more ints with the readDecimal_ function from Data.ByteString.Lex.Integral in exchange for using Data.IntMap.Strict. I think it will be a good tradeoff. Lets see.

Here is the full code with the change to Data.IntMap.strict. I had to change on import and use toInt on the key in the first argument of foldl’:

module Main where

import qualified Data.ByteString.Char8 as C8
import qualified Data.IntMap.Strict as M
import Data.List
import Data.ByteString.Lex.Integral

toInt :: C8.ByteString -> Int
toInt = readDecimal_

processFile = do
  bsLines <- C8.lines <$> C8.readFile "../ngrams.tsv"
  let info = C8.splitWith (=='\t') <$> bsLines :: [[C8.ByteString]]
  let m = foldl'
             (\acc x -> let (_:key:val:_) = (x :: [C8.ByteString]) in M.insertWith (\new old -> old + new) (toInt key) (toInt val) acc)
             M.empty
             info
  pure $ last $ sortOn snd (M.toList m)

main :: IO ()
main = do
  x <- processFile
  print x

Let’s see what the results are:

cody@zentop:~/faster-command-line-tools-with-haskell/haskell-version$ stack clean && stack build && stack exec -- time haskell-version
haskell-version-0.1.0.0: configure (exe)
Configuring haskell-version-0.1.0.0...
haskell-version-0.1.0.0: build (exe)
Preprocessing executable 'haskell-version' for haskell-version-0.1.0.0...
[1 of 1] Compiling Main             ( src/Main.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version-t
mp/Main.o )                                                                                                                           
Linking .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version ...
haskell-version-0.1.0.0: copy/register
Installing executable(s) in
/home/cody/faster-command-line-tools-with-haskell/haskell-version/.stack-work/install/x86_64-linux/lts-9.0/8.0.2/bin
(2006,22569013)
2.59user 0.06system 0:02.65elapsed 99%CPU (0avgtext+0avgdata 240216maxresident)k
0inputs+0outputs (0major+12897minor)pagefaults 0swaps

2.59 seconds! Wow, we are very close to the optimized Go program without too much arcane knowledge using Haskell :)

Let’s look at the profile for this one!

cody@zentop:~/faster-command-line-tools-with-haskell/haskell-version$ stack clean && stack build --profile && stack exec -- haskell-version +RTS -p                                                                                                                          
haskell-version-0.1.0.0: configure (exe)
Configuring haskell-version-0.1.0.0...
haskell-version-0.1.0.0: build (exe)
Preprocessing executable 'haskell-version' for haskell-version-0.1.0.0...
[1 of 1] Compiling Main             ( src/Main.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version-tmp/Main.p_o )                                                                                                                         
Linking .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version ...
haskell-version-0.1.0.0: copy/register
Installing executable(s) in
/home/cody/faster-command-line-tools-with-haskell/haskell-version/.stack-work/install/x86_64-linux/lts-9.0/8.0.2/bin
(2006,22569013)
cody@zentop:~/faster-command-line-tools-with-haskell/haskell-version$ cat haskell-version.prof 
        Sun Jul 30 23:52 2017 Time and Allocation Profiling Report  (Final)

           haskell-version +RTS -p -RTS

        total time  =        7.59 secs   (7595 ticks @ 1000 us, 1 processor)
        total alloc = 21,369,235,392 bytes  (excludes profiling overheads)

COST CENTRE        MODULE                       SRC                                                   %time %alloc

processFile.info   Main                         src/Main.hs:13:7-67                                    54.9   60.4
processFile.m.\    Main                         src/Main.hs:15:25-134                                  21.6   26.7
processFile        Main                         src/Main.hs:(11,1)-(18,39)                             11.6    9.6
readDecimal_.start Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(260,5)-(265,32)    5.5    1.6
readDecimal_       Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(246,1)-(335,59)    1.2    0.0
processFile.m.\.\  Main                         src/Main.hs:15:97-105                                   1.1    0.8
processFile.m      Main                         src/Main.hs:(14,7)-(17,17)                              1.1    0.0


                                                                                                                                      individual      inherited                                                                                                              
COST CENTRE                      MODULE                       SRC                                                  no.     entries  %time %alloc   %time %alloc                                                                                                              

MAIN                             MAIN                         <built-in>                                           154          0    0.
0    0.0   100.0  100.0CAF                             GHC.IO.Handle.FD             <entire-module>                                      195          0    0.
0    0.0     0.0    0.0                                                                                                               
 CAF                             GHC.IO.Handle.Text           <entire-module>                                      193          0    0.0    0.0     0.0    0.0                                                                                                               
 CAF                             GHC.IO.Encoding              <entire-module>                                      187          0    0.0    0.0     0.0    0.0                                                                                                               
 CAF                             GHC.IO.FD                    <entire-module>                                      186          0    0.0    0.0     0.0    0.0                                                                                                               
 CAF                             GHC.Conc.Signal              <entire-module>                                      182          0    0.0    0.0     0.0    0.0                                                                                                               
 CAF                             GHC.IO.Encoding.Iconv        <entire-module>                                      174          0    0.0    0.0     0.0    0.0                                                                                                               
 CAF:main1                       Main                         <no location info>                                   306          0    0.0    0.0     0.0    0.0                                                                                                               
  main                           Main                         src/Main.hs:(21,1)-(23,9)                            308          1    0.0    0.0     0.0    0.0                                                                                                               
 CAF:main2                       Main                         <no location info>                                   304          0    0.0    0.0     0.0    0.0                                                                                                               
  processFile                    Main                         src/Main.hs:(11,1)-(18,39)                           310          1    0.0    0.0     0.0    0.0                                                                                                               
 CAF:main7                       Main                         <no location info>                                   302          0    0.0    0.0     0.0    0.0                                                                                                               
  processFile                    Main                         src/Main.hs:(11,1)-(18,39)                           312          0    0.0    0.0     0.0    0.0                                                                                                               
 CAF:readDecimal__$sreadDecimal_ Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:246:1-12         277          0    0.0    0.0     0.0    0.0                                                                                                               
  readDecimal_                   Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(246,1)-(335,59) 317          1    0.0    0.0     0.0    0.0                                                                                                               
 CAF:toInt                       Main                         src/Main.hs:9:1-5                                    303          0    0.0    0.0     0.0    0.0                                                                                                               
  toInt                          Main                         src/Main.hs:9:1-20                                   316          1    0.0    0.0     0.0    0.0                                                                                                               
 main                            Main                         src/Main.hs:(21,1)-(23,9)                            309          0    0.0    0.0   100.0  100.0                                                                                                               
  processFile                    Main                         src/Main.hs:(11,1)-(18,39)                           311          0   11.6    9.6   100.0  100.0                                                                                                               
   processFile.info              Main                         src/Main.hs:13:7-67                                  313          1   54.9   60.4    54.9   60.4                                                                                                               
   processFile.m                 Main                         src/Main.hs:(14,7)-(17,17)                           314          1    1.1    0.0    33.5   30.1                                                                                                               
    processFile.m.\              Main                         src/Main.hs:15:25-134                                315   10512769   21.6   26.7    32.4   30.1                                                                                                               
     processFile.m.\.(...)       Main                         src/Main.hs:15:29-66                                 321   10512769    0.0    0.0     0.0    0.0                                                                                                               
     processFile.m.\.key         Main                         src/Main.hs:15:29-66                                 320   10512769    0.3    0.0     0.3    0.0                                                                                                               
     processFile.m.\.val         Main                         src/Main.hs:15:29-66                                 326   10512769    0.3    0.0     0.3    0.0                                                                                                               
     processFile.m.\.\           Main                         src/Main.hs:15:97-105                                327   10512355    1.1    0.8     1.1    0.8                                                                                                               
     readDecimal_                Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(246,1)-(335,59) 318          0    1.2    0.0     9.0    2.5                                                                                                               
      readDecimal_.start         Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(260,5)-(265,32) 319   21025538    5.5    1.6     7.8    2.5                                                                                                               
       readDecimal_.loop0        Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(268,5)-(274,32) 322   21025538    0.9    0.0     2.3    1.0                                                                                                               
        readDecimal_.loop1       Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(278,5)-(284,52) 323   12785161    0.6    0.1     1.4    1.0                                                                                                               
         readDecimal_.loop2      Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(285,5)-(291,53) 324   10839846    0.6    0.0     0.8    0.8                                                                                                               
          readDecimal_.loop3     Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(292,5)-(298,54) 325   10556577    0.2    0.8     0.2    0.8                                                                                                               
           readDecimal_.loop4    Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(299,5)-(305,55) 328       6556    0.0    0.0     0.0    0.0                                                                                                               
            readDecimal_.loop5   Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(306,5)-(312,56) 329        724    0.0    0.0     0.0    0.0                                                                                                               
             readDecimal_.loop6  Data.ByteString.Lex.Integral src/Data/ByteString/Lex/Integral.hs:(313,5)-(319,57) 330         71    0.0    0.0     0.0    0.0                                               

It looks like over half of our time is spent in info, which splits all of our lines we’ve read by ‘’ into a new list. I wonder if using BangPattern’s to make info strict would help? Let’s see….

It actually made it a little slower. If you notice, we actually discard parts of the list later in the fold:

let m = foldl'
             (\acc x -> let (_:key:val:_) = (x :: [C8.ByteString]) in M.insertWith (\new old -> old + new) (toInt key) (toInt val) acc)
             M.empty
             info

If we make info strict then we load those pieces that we don’t use into memory anyway. Thanks laziness! Hm, that makes me wonder… what would happen if we use lazy ByteStrings? Let’s try it… why not? I’ll keep the IntMap strict since we know every piece of it constructed is needed for the final result.

Let’s look over the code again:

toInt :: C8.ByteString -> Int
toInt = readDecimal_

processFile = do
  bsLines <- C8.lines <$> C8.readFile "../ngrams.tsv"
  let info = C8.splitWith (=='\t') <$> bsLines :: [[C8.ByteString]]
  let m = foldl'
             (\acc x -> let (_:key:val:_) = (x :: [C8.ByteString]) in M.insertWith (\new old -> old + new) (toInt key) (toInt val) acc)
             M.empty
             info
  pure $ last $ sortOn snd (M.toList m)

AHA! C8.splitWith (==‘’) is doing the same functionality that C8.words is commonly used for. I’m also pretty sure since it’s widely used that C8.words will be faster than C8.splitWith (==‘’).

Here’s the code after the change:

processFile = do
  info <- fmap C8.words . C8.lines <$> C8.readFile "../ngrams.tsv"
  let m = foldl'
             (\acc x -> let (_:key:val:_) = (x :: [C8.ByteString]) in M.insertWith (\new old -> old + new) (toInt key) (toInt val) acc)
             M.empty
             info
  pure $ last $ sortOn snd (M.toList m)

Now let’s run it!

cody@zentop:~/faster-command-line-tools-with-haskell/haskell-version$ stack clean && stack build && stack exec -- time haskell-version
haskell-version-0.1.0.0: configure (exe)
Configuring haskell-version-0.1.0.0...
haskell-version-0.1.0.0: build (exe)
Preprocessing executable 'haskell-version' for haskell-version-0.1.0.0...
[1 of 1] Compiling Main             ( src/Main.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version-tmp/Main.o )
Linking .stack-work/dist/x86_64-linux/Cabal-1.24.2.0/build/haskell-version/haskell-version ...
haskell-version-0.1.0.0: copy/register
Installing executable(s) in
/home/cody/faster-command-line-tools-with-haskell/haskell-version/.stack-work/install/x86_64-linux/lts-9.0/8.0.2/bin
(2006,22569011)
2.00user 0.07system 0:02.07elapsed 99%CPU (0avgtext+0avgdata 245192maxresident)k
0inputs+0outputs (0major+14078minor)pagefaults 0swaps

2 seconds! Awesome. We beat Go :P

Now to the elephant in the room: Why was C8.words faster?

To the implementation!

-- | 'words' breaks a ByteString up into a list of words, which
-- were delimited by Chars representing white space.
words :: ByteString -> [ByteString]
words = P.filter (not . B.null) . B.splitWith isSpaceWord8
{-# INLINE words #-}

B.splitWith is actually used within words, but it uses isSpaceWord8 from Data.ByteString.Internal:

-- | Selects words corresponding to white-space characters in the Latin-1 range
-- ordered by frequency.
isSpaceWord8 :: Word8 -> Bool
isSpaceWord8 w =
    w == 0x20 ||
    w == 0x0A || -- LF, \n
    w == 0x09 || -- HT, \t
    w == 0x0C || -- FF, \f
    w == 0x0D || -- CR, \r
    w == 0x0B || -- VT, \v
    w == 0xA0    -- spotted by QC..
{-# INLINE isSpaceWord8 #-}

It does a comparison to the Word8 value directly rather than \t, though thinking in terms of platonic ideals I’m wondering if this isn’t something a sufficiently smart compiler should aim to optimize.

There we have it: Faster command line tools with Haskell and a short practical guide to benchmarking and improving performance!