[Haskell-beginners] The cost of generality, or how expensive is realToFrac?

Post by Greg
First, to anyone who recognizes me by name, thanks to the help I've been
getting here I've managed to put together a fairly complex set of code
files that all compile together nicely, run, and do exactly what I
wanted them to do. Success!
The trouble is that my implementation is dog slow
Fortunately, this isn't the first time I've been in over my head and I
started by putting up some simpler scaffolding- which runs much more
quickly. Working backwards, it looks like the real bottle neck is in
the data types I've created, the type variables I've introduced, and the
conversion code I needed to insert to make it all happy.
I'm not sure it helps, but I've attached a trimmed down version of the
relevant code. What should be happening is my pair is being converted
to the canonical form for Coord2D which is Cartesian2D and then
converted again to Vertex2. There shouldn't be any change made to the
values, they're only being handed from one container to another in this
case (Polar coordinates would require computation, but I've stripped
that out for the time being). However, those handoffs require calls to
realToFrac to make the type system happy, and that has to be what is
eating up all my CPU.

Not all, but probably a big chunk of it.
The problem is that the default implementation of realToFrac is

realToFrac = fromRational . toRational

a) with that implementation, realToFrac :: Double -> Double is not the
identity (doesn't respect NaNs)
b) it's slow, there are no special operations to convert Double, Float etc.
from/to Rational.

For a lot of types, GHC provides rewrite rules (you need to compile with
optimisations to have them fire) which give faster versions (with somewhat
different behaviour, e.g. realToFrac :: Double -> Double is rewritten to
id, realToFrac between Float and Double uses primitive widening/narrowing
ops, for several newtype wrappers around Float/Double there are rules too).

Post by Greg
I think there are probably 4 calls to realToFrac. If I walk through the
code, the result, given the pair p, should be: Vertex2 (realToFrac
(realToFrac (fst p))) (realToFrac (realToFrac (snd p)))
I'd like to maintain type independence if possible, but I expect most
uses of this code to feed Doubles in for processing and probably feed
GLclampf (Floats, I believe)

newtype wrapper around CFloat, which is a newtype wrapper around Float

Unfortunately, there are no rewrite rules in the module where it is
defined, apparently neither any other module that has access to the
constructor. And the constructor is not accessible from any of the exposed
modules, so as far as I know, you can't provide your own rewrite rules.

Post by Greg
to the OpenGL layer. If there's a way to
do so, I wouldn't mind optimizing for that particular set of types.
I've tried GLdouble, and it doesn't really improve things with the
current code.
Is there a way to short circuit those realToFrac calls if we know the
input and output are the same type? Is there a way merge the nested
calls?

You can try rewrite rules

{-# RULES
"realToFrac2/realToFrac" realToFrac . realToFrac = realToFrac
"realToFrac/id" realToFrac = id
#-}

but I'm afraid the second won't work at all, then you'd have to specify all
interesting cases yourself (there are rules for the cases Double -> Double
and Float -> Float in GHC.Float, rules for converting from/to CFloat and
CDouble in Foreign.C.Types, so those should be fine too)
"realToFrac/GLclampf->GLclampf" realToFrac = id :: GLclampf -> GLclampf
and what ese you need.
Whether the first one will help (or even work), I don't know either, you
have to try.

Post by Greg
Any other thoughts on what I can do here? The slow down between the two
implementations is at least 20x, which seems like a steep penalty to
pay.

In case of emergency, put the needed rewrite rules into the source of
OpenGLRaw yourself.

Post by Greg
And while I'm at it, is turning on FlexibleInstances the only way to
create an instance for (a,a)?

Yes. Haskell98 doesn't allow such instance declarations, so you need the
extension.

Greg

2010-09-15 18:50:13 UTC

Hey, thanks, Daniel.

I hadn't come across rewrite rules yet. They definitely look like something worth learning, though I'm not sure I'm prepared to start making custom versions of OpenGL.Raw...

It looks like I managed to put that battle off for another day, however. I did look at how realToFrac is implemented and (as you mention) it does the fromRational . toRational transform pair suggested in a number of sources, including Real World Haskell. Looking at what toRational is doing, creating a ratio of integers out of a float it seems like a crazy amount of effort to go through just to convert floating point numbers.

Looking at the RealFloat class rather that Real and Fractional, it seems like this is a much more efficient way to go:

floatToFloat :: (RealFloat a, RealFloat b) => a -> b
floatToFloat = (uncurry encodeFloat) . decodeFloat

I substituted this in for realToFrac and I'm back to close to my original performance. Playing with a few test cases in ghci, it looks numerically equivalent to realToFrac.

This begs the question though-- am I doing something dangerous here? Why isn't this the standard approach?

If I understand what's happening, decodeFloat and encodeFloat are breaking the floating point numbers up into their constituent parts-- presumably by bit masking the raw binary. That would explain the performance improvement. I suppose there is some implementation dependence here, but as long as the encode and decode are implemented as a matched set then I think I'm good.

Cheers--
Greg

Not all, but probably a big chunk of it.
The problem is that the default implementation of realToFrac is
realToFrac = fromRational . toRational
a) with that implementation, realToFrac :: Double -> Double is not the
identity (doesn't respect NaNs)
b) it's slow, there are no special operations to convert Double, Float etc.
from/to Rational.
For a lot of types, GHC provides rewrite rules (you need to compile with
optimisations to have them fire) which give faster versions (with somewhat
different behaviour, e.g. realToFrac :: Double -> Double is rewritten to
id, realToFrac between Float and Double uses primitive widening/narrowing
ops, for several newtype wrappers around Float/Double there are rules too).

newtype wrapper around CFloat, which is a newtype wrapper around Float
Unfortunately, there are no rewrite rules in the module where it is
defined, apparently neither any other module that has access to the
constructor. And the constructor is not accessible from any of the exposed
modules, so as far as I know, you can't provide your own rewrite rules.

You can try rewrite rules
{-# RULES
"realToFrac2/realToFrac" realToFrac . realToFrac = realToFrac
"realToFrac/id" realToFrac = id
#-}
but I'm afraid the second won't work at all, then you'd have to specify all
interesting cases yourself (there are rules for the cases Double -> Double
and Float -> Float in GHC.Float, rules for converting from/to CFloat and
CDouble in Foreign.C.Types, so those should be fine too)
"realToFrac/GLclampf->GLclampf" realToFrac = id :: GLclampf -> GLclampf
and what ese you need.
Whether the first one will help (or even work), I don't know either, you
have to try.

Post by Greg
Any other thoughts on what I can do here? The slow down between the two
implementations is at least 20x, which seems like a steep penalty to
pay.

In case of emergency, put the needed rewrite rules into the source of
OpenGLRaw yourself.

Post by Greg
And while I'm at it, is turning on FlexibleInstances the only way to
create an instance for (a,a)?

Yes. Haskell98 doesn't allow such instance declarations, so you need the
extension.

Daniel Fischer

2010-09-15 22:36:11 UTC

I hadn't come across rewrite rules yet. They definitely look like
something worth learning,

Absolutely. GHC's optimiser is good, but there are a lot of cases where you
need to push it via rewrite rules if you write polymorphic code or if you
want to eliminate intermediate data structures (e.g. list fusion).

though I'm not sure I'm prepared to start making custom versions
of OpenGL.Raw...

Yes, if you can work around the issues without that, it's better to leave
it in peace :)
Though you might ask the maintainer for rewrite rules.

It looks like I managed to put that battle off for another day, however.
I did look at how realToFrac is implemented and (as you mention) it
does the fromRational . toRational transform pair suggested in a number
of sources, including Real World Haskell. Looking at what toRational is
doing, creating a ratio of integers out of a float it seems like a crazy
amount of effort to go through just to convert floating point numbers.

I just did some benchmarking.

I benchmarked

foldl' (+) 0 [convert (1 / intToDoub k) | k <- [1 .. 100000]]

where

intToDoub :: Int -> Double
intToDoub = fromIntegral

for several functions

convert :: Double -> Float (actually, the type has been
(RealFloat a, RealFloat b) => a -> b, but it was used at a = Double,
b = Float).
Everything was compiled with -O2, so the rewrite rules fired, in particular
intToDoub was replaced by a primop (int2Double#), so that one's ultra
cheap.

For convert = realToFrac (hence by the rewrite rules the primop
double2Float# was used), I got pretty good times, mean was 6.76 ms.

For convert = floatToFloat from below, the times were not too bad, with a
mean of 26.3 ms.
A factor of roughly four for this benchmark (the factor for the conversion
itself will be higher, but not exorbitantly) means it's usable in many
situations, but not in performance critical situations where the conversion
takes a significant amount of the running time. If you're converting to
draw stuff with OpenGL (or some other graphics library), the conversion
will take only a relatively small part of the time, so it's fine.

For convert = fromRational . toRational (so no rewrite rules), the times
were rather appalling: mean was 3.34 seconds.
A factor of nearly 500 versus double2Float#.

toRational is bad. Looking at

instance Real Double where
toRational x = (m%1)*(b%1)^^n
where (m,n) = decodeFloat x
b = floatRadix x

(same for Float), and the implementations of (^^) and (^),

{-# SPECIALISE (^) ::
Integer -> Integer -> Integer,
Integer -> Int -> Integer,
Int -> Int -> Int #-}
(^) :: (Num a, Integral b) => a -> b -> a
x0 ^ y0 | y0 < 0 = error "Negative exponent"
| y0 == 0 = 1
| otherwise = f x0 y0
where -- f : x0 ^ y0 = x ^ y
f x y | even y = f (x * x) (y `quot` 2)
| y == 1 = x
| otherwise = g (x * x) ((y - 1) `quot` 2) x
-- g : x0 ^ y0 = (x ^ y) * z
g x y z | even y = g (x * x) (y `quot` 2) z
| y == 1 = x * z
| otherwise = g (x * x) ((y - 1) `quot` 2) (x * z)

-- | raise a number to an integral power
{-# SPECIALISE (^^) ::
Rational -> Int -> Rational #-}
(^^) :: (Fractional a, Integral b) => a -> b -> a
x ^^ n = if n >= 0 then x^n else recip (x^(negate n))

together with the multiplication and recip for Rationals, I have to say
ouch!
There's no special implementation and rewrite rule for powers of Rationals,
so on each multiplication in (^), the gcd of numerator and denominator is
calculated, *although as powers of the original numerator and denominator
they are guaranteed to be coprime*. Considering how slow a division of
Integers is, awwwwwww noooooo.

So let's look at a better implementation of toRational:

toRat :: RealFloat a => a -> Rational
toRat x = case decodeFloat x of
(m,e) -> case floatRadix x of
b -> if e < 0
then (m % (b^(negate e)))
else (m * b^e) :% 1

(inlined a better implementation of powers for Rationals).

Benchmarking convert = fromRational . toRat show a significant improvement,
the mean dropped to 2.75 seconds.
Still appalling, but it's a nice improvement and I don't see any quick
opportunities to improve that conversion.

So let's come to the last, fromRational. That's a compicated function, and
unfortunately it has to be and I've no good idea to improve it.
fromRational is really evil (in terms of clock cycles).
Replacing fromRational with a dummy that just forces the evaluation of its
argument and returns NaN, ±Infinity, or 0 for all real Rational values,

dummy . toRational had a mean of 623.5 ms and
dummy . toRat had a mean of 200.7 ms.

So toRat is a jolly good improvement over toRational, but it's still
awfully slow. And since fromRational takes much much longer anyway, it's a
not too impressive gain for realToFrac.

Looking at the RealFloat class rather that Real and Fractional, it seems
floatToFloat :: (RealFloat a, RealFloat b) => a -> b
floatToFloat = (uncurry encodeFloat) . decodeFloat

Yes, that's much more efficient, as witnessed by the benchmark results.
But.

I substituted this in for realToFrac and I'm back to close to my
original performance. Playing with a few test cases in ghci, it looks
numerically equivalent to realToFrac.
This begs the question though--

No. Sorry, but I can't bear that misuse:
http://en.wikipedia.org/wiki/Begging_the_question

It raises/demands/invites the question, but it doesn't beg it.

am I doing something dangerous here?

Yes and no.

Why isn't this the standard approach?

Because it will wreak unspeakable havoc when someone creates a RealFloat
instance with a floatRadix > 2. A floatRadix of 10 or some power of 2 (16,
256?) could be even reasonable.

But for conversions between RealFloat types with the same floatRadix, it's
sort of okay, only it clobbers NaNs and (for some conversions) Infinities.
However, realToFrac does that too.

If I understand what's happening, decodeFloat and encodeFloat are
breaking the floating point numbers up into their constituent parts--
presumably by bit masking the raw binary.

Probably.

That would explain the
performance improvement. I suppose there is some implementation
dependence here, but as long as the encode and decode are implemented as
a matched set then I think I'm good.

Not entirely matched, for Float -> Float and Double -> Double,
NaN -> -Infinity, maybe denormalized values break too.

±Infinity is mapped to a finite value at Float -> Double

But since toRational uses decodeFloat and fromRational uses encodeFloat,
floatToFloat is in that respect no worse than realToFrac without rewrite
rules.

Cheers--
Greg

Greg

2010-09-16 19:47:43 UTC

Wow, thanks for all the analysis on this, Daniel!

So, I think the summary of your evaluation is that realToFrac does just fine in almost all cases due to careful optimization and outperforms my floatToFloat trick (which answers the question I flagrantly begged... ;~) except that I can't access those optimizations because the OpenGL types are hidden behind newtypes and buried in a library.

I've confirmed your results to make sure I could.

I went a step further and typed convert for GLclampf expecting to see unoptimized performance, but it ran just as fast as Double->Float.

convert :: Double -> GL.GLclampf
convert = realToFrac

And still ran faster than floatToFloat. However there's no denying that floatToFloat runs *much* faster than realToFrac in the larger application. Profiling shows floatToFloat taking about 50% of my CPU time, but the frame rate is close to (though not quite) 30 fps. Using realToFrac takes seconds just to display the first frame.

As it stands, floatToFloat is responsible for 50% of my CPU time which I think is mainly a tribute the OpenGL implementation, but still makes it the obvious target for optimization. Having gotten all excited about the new tool in my box, I wrote the following rule:

{-# RULES
"floatToFloat/id" floatToFloat=id
"floatToFloat x2" floatToFloat . floatToFloat = floatToFloat
#-}

Neither of which seems to fires in this application, but I did get the first one to fire by importing the same file into my benchmark.

The next obvious step is to optimize a level up in the call hierarchy, and rewrite coordToCoord2D since I know my original pair2vertex function was faster. So, I added this rule in the file where Vertex2 is made an instance of Coord2D:

{-# RULES
"coordToCoord2D/p2v2" coordToCoord2D = pair2vertex
#-}

which refers to two functions, each in different files (I don't think that matters, but mention it just in case)

pair2vertex :: (Num a) => (a,a) -> GL.Vertex2 a
pair2vertex (x,y) = GL.Vertex2 x y

coordToCoord2D :: (Coord2D a, Coord2D b) => a -> b
coordToCoord2D = fromCartesian2D . toCartesian2D

directly after my coordToCoord2D definition I have this rule as well:

{-# RULES
"coordToCoord2D/id" coordToCoord2D = id
"coordToCoord2D x2" coordToCoord2D . coordToCoord2D = coordToCoord2D
#-}

I get a compile time error that I can't make sense of. It's asking me to put a context on my rule, but I can't find any references on how to do that...

-----------

Could not deduce (Num a)
from the context (Coord2D (Vertex2 a), Coord2D (a, a))
arising from a use of `pair2vertex'
at GCB/OpenGL/Geometry.hs:32:40-50
Possible fix:
add (Num a) to the context of the RULE "coordToCoord2D/p2v2"
In the expression: pair2vertex
When checking the transformation rule "coordToCoord2D/p2v2"

-----------

The file:line:column is the "pair2vertex" token in the rule I list above.

Cheers--
Greg

Post by Greg
I hadn't come across rewrite rules yet. They definitely look like
something worth learning,

Post by Greg
though I'm not sure I'm prepared to start making custom versions
of OpenGL.Raw...

Yes, if you can work around the issues without that, it's better to leave
it in peace :)
Though you might ask the maintainer for rewrite rules.

Post by Greg
It looks like I managed to put that battle off for another day, however.
I did look at how realToFrac is implemented and (as you mention) it
does the fromRational . toRational transform pair suggested in a number
of sources, including Real World Haskell. Looking at what toRational is
doing, creating a ratio of integers out of a float it seems like a crazy
amount of effort to go through just to convert floating point numbers.

I just did some benchmarking.
I benchmarked
foldl' (+) 0 [convert (1 / intToDoub k) | k <- [1 .. 100000]]
where
intToDoub :: Int -> Double
intToDoub = fromIntegral
for several functions
convert :: Double -> Float (actually, the type has been
(RealFloat a, RealFloat b) => a -> b, but it was used at a = Double,
b = Float).
Everything was compiled with -O2, so the rewrite rules fired, in particular
intToDoub was replaced by a primop (int2Double#), so that one's ultra
cheap.
For convert = realToFrac (hence by the rewrite rules the primop
double2Float# was used), I got pretty good times, mean was 6.76 ms.
For convert = floatToFloat from below, the times were not too bad, with a
mean of 26.3 ms.
A factor of roughly four for this benchmark (the factor for the conversion
itself will be higher, but not exorbitantly) means it's usable in many
situations, but not in performance critical situations where the conversion
takes a significant amount of the running time. If you're converting to
draw stuff with OpenGL (or some other graphics library), the conversion
will take only a relatively small part of the time, so it's fine.
For convert = fromRational . toRational (so no rewrite rules), the times
were rather appalling: mean was 3.34 seconds.
A factor of nearly 500 versus double2Float#.
toRational is bad. Looking at
instance Real Double where
toRational x = (m%1)*(b%1)^^n
where (m,n) = decodeFloat x
b = floatRadix x
(same for Float), and the implementations of (^^) and (^),
Integer -> Integer -> Integer,
Integer -> Int -> Integer,
Int -> Int -> Int #-}
(^) :: (Num a, Integral b) => a -> b -> a
x0 ^ y0 | y0 < 0 = error "Negative exponent"
| y0 == 0 = 1
| otherwise = f x0 y0
where -- f : x0 ^ y0 = x ^ y
f x y | even y = f (x * x) (y `quot` 2)
| y == 1 = x
| otherwise = g (x * x) ((y - 1) `quot` 2) x
-- g : x0 ^ y0 = (x ^ y) * z
g x y z | even y = g (x * x) (y `quot` 2) z
| y == 1 = x * z
| otherwise = g (x * x) ((y - 1) `quot` 2) (x * z)
-- | raise a number to an integral power
Rational -> Int -> Rational #-}
(^^) :: (Fractional a, Integral b) => a -> b -> a
x ^^ n = if n >= 0 then x^n else recip (x^(negate n))
together with the multiplication and recip for Rationals, I have to say
ouch!
There's no special implementation and rewrite rule for powers of Rationals,
so on each multiplication in (^), the gcd of numerator and denominator is
calculated, *although as powers of the original numerator and denominator
they are guaranteed to be coprime*. Considering how slow a division of
Integers is, awwwwwww noooooo.
toRat :: RealFloat a => a -> Rational
toRat x = case decodeFloat x of
(m,e) -> case floatRadix x of
b -> if e < 0
then (m % (b^(negate e)))
else (m * b^e) :% 1
(inlined a better implementation of powers for Rationals).
Benchmarking convert = fromRational . toRat show a significant improvement,
the mean dropped to 2.75 seconds.
Still appalling, but it's a nice improvement and I don't see any quick
opportunities to improve that conversion.
So let's come to the last, fromRational. That's a compicated function, and
unfortunately it has to be and I've no good idea to improve it.
fromRational is really evil (in terms of clock cycles).
Replacing fromRational with a dummy that just forces the evaluation of its
argument and returns NaN, ±Infinity, or 0 for all real Rational values,
dummy . toRational had a mean of 623.5 ms and
dummy . toRat had a mean of 200.7 ms.
So toRat is a jolly good improvement over toRational, but it's still
awfully slow. And since fromRational takes much much longer anyway, it's a
not too impressive gain for realToFrac.

Post by Greg
Looking at the RealFloat class rather that Real and Fractional, it seems
floatToFloat :: (RealFloat a, RealFloat b) => a -> b
floatToFloat = (uncurry encodeFloat) . decodeFloat

Yes, that's much more efficient, as witnessed by the benchmark results.
But.

Post by Greg
I substituted this in for realToFrac and I'm back to close to my
original performance. Playing with a few test cases in ghci, it looks
numerically equivalent to realToFrac.
This begs the question though--

http://en.wikipedia.org/wiki/Begging_the_question
It raises/demands/invites the question, but it doesn't beg it.

Post by Greg
am I doing something dangerous here?

Yes and no.

Post by Greg
Why isn't this the standard approach?

Because it will wreak unspeakable havoc when someone creates a RealFloat
instance with a floatRadix > 2. A floatRadix of 10 or some power of 2 (16,
256?) could be even reasonable.
But for conversions between RealFloat types with the same floatRadix, it's
sort of okay, only it clobbers NaNs and (for some conversions) Infinities.
However, realToFrac does that too.

Post by Greg
If I understand what's happening, decodeFloat and encodeFloat are
breaking the floating point numbers up into their constituent parts--
presumably by bit masking the raw binary.

Probably.

Post by Greg
That would explain the
performance improvement. I suppose there is some implementation
dependence here, but as long as the encode and decode are implemented as
a matched set then I think I'm good.

Not entirely matched, for Float -> Float and Double -> Double,
NaN -> -Infinity, maybe denormalized values break too.
±Infinity is mapped to a finite value at Float -> Double
But since toRational uses decodeFloat and fromRational uses encodeFloat,
floatToFloat is in that respect no worse than realToFrac without rewrite
rules.

Post by Greg
Cheers--
Greg

Daniel Fischer

2010-09-16 21:33:53 UTC

Post by Greg
Wow, thanks for all the analysis on this, Daniel!
So, I think the summary of your evaluation is that realToFrac does just
fine in almost all cases

I wouldn't say that. realToFrac does fine in those cases where a rewrite
rule provides a fast conversion (or one of the types bewteen which you want
to convert is Rational, when you get fromRational or toRational - the
latter can be improved in several cases).

Post by Greg
due to careful optimization and outperforms my
floatToFloat trick (which answers the question I flagrantly begged...
;~) except that I can't access those optimizations because the OpenGL
types are hidden behind newtypes and buried in a library.
I've confirmed your results to make sure I could.
I went a step further and typed convert for GLclampf expecting to see
unoptimized performance, but it ran just as fast as Double->Float.
convert :: Double -> GL.GLclampf
convert = realToFrac

It would be interesting to see what core GHC produces for that (you can get
the core with the `-ddump-simpl' command line flag [redirect stdout to a
file] or with the ghc-core tool [available on hackage]).
If it runs as fast as realToFrac :: Double -> Float (with optimisations),
GHC must have rewritten realToFrac to double2Float# and it should only do
that if there are rewrite rules for GLclampf.
In that case, the problem is probably that GHC doesn't see the realToFrac
applications because they're too deeply wrapped in your coordToCoord2D
calls.

If that is the problem, it might help to use {-# INLINE #-} pragmas on
coordToCoord2D, fromCartesian2D and toCartesian2D.
Can you try with realToFrac and the {-# INLINE #-} pragmas?

Post by Greg
And still ran faster than floatToFloat. However there's no denying that
floatToFloat runs *much* faster than realToFrac in the larger
application. Profiling shows floatToFloat taking about 50% of my CPU

That's too much for my liking, a simple conversion from Double to Float
shouldn't take long, even if the Float is wrapped in newtypes (after all,
the newtypes don't exist at runtime).

Post by Greg
time, but the frame rate is close to (though not quite) 30 fps. Using
realToFrac takes seconds just to display the first frame.
As it stands, floatToFloat is responsible for 50% of my CPU time which I
think is mainly a tribute the OpenGL implementation, but still makes it
the obvious target for optimization. Having gotten all excited about
{-# RULES
"floatToFloat/id" floatToFloat=id
"floatToFloat x2" floatToFloat . floatToFloat = floatToFloat
#-}
Neither of which seems to fires in this application,

GHC reports fired rules with -ddump-simpl-stats.
Getting rules to fire is a little brittle, GHC does not try too hard to
match expressions with rules, and if several rules match, it chooses one
arbitrarily, so your rules may have been missed because the actual code
looked different (perhaps because other rewrite rules fired first).

Post by Greg
but I did get the
first one to fire by importing the same file into my benchmark.
The next obvious step is to optimize a level up in the call hierarchy,
and rewrite coordToCoord2D since I know my original pair2vertex function
was faster. So, I added this rule in the file where Vertex2 is made an
{-# RULES
"coordToCoord2D/p2v2" coordToCoord2D = pair2vertex
#-}
which refers to two functions, each in different files (I don't think
that matters, but mention it just in case)
pair2vertex :: (Num a) => (a,a) -> GL.Vertex2 a
pair2vertex (x,y) = GL.Vertex2 x y
coordToCoord2D :: (Coord2D a, Coord2D b) => a -> b
coordToCoord2D = fromCartesian2D . toCartesian2D
{-# RULES
"coordToCoord2D/id" coordToCoord2D = id
"coordToCoord2D x2" coordToCoord2D . coordToCoord2D = coordToCoord2D
#-}
I get a compile time error that I can't make sense of. It's asking me
to put a context on my rule, but I can't find any references on how to
do that...
-----------
Could not deduce (Num a)
from the context (Coord2D (Vertex2 a), Coord2D (a, a))
arising from a use of `pair2vertex'
at GCB/OpenGL/Geometry.hs:32:40-50
add (Num a) to the context of the RULE "coordToCoord2D/p2v2"
In the expression: pair2vertex
When checking the transformation rule "coordToCoord2D/p2v2"
-----------
The file:line:column is the "pair2vertex" token in the rule I list above.

Yes, there's a Num constraint on pair2vertex, but not on coordToCoord2D, so
it's not type correct.
You could try removing the Num constraint from pair2vertex or add the
constraint to the rule,

{-# RULES
"coordToCoord2D/p2v2" forall a. Num a => (coordToCoord2D :: (a,a) ->
GL.Vertex2 a) = pair2vertex
#-}

(well, I don't know whether that's the correct way, but you can try).

Post by Greg
Cheers--
Greg

Greg

2010-09-19 00:41:36 UTC

Post by Daniel Fischer
It would be interesting to see what core GHC produces for that (you can get
the core with the `-ddump-simpl' command line flag [redirect stdout to a
file] or with the ghc-core tool [available on hackage]).
If it runs as fast as realToFrac :: Double -> Float (with optimisations),
GHC must have rewritten realToFrac to double2Float# and it should only do
that if there are rewrite rules for GLclampf.

I'm not sure if you literally meant you wanted to see the output or not, but I've attached a zip of the dump files and my simple source file. The dump file naming is cryptic, but the first letters refer to the definition of 'convert' where:

fTF: use the floatToFloat function in the source file
rTF: use the standard realToFrac
fRtR: use (fromRational . toRational)

The next three characters indicate the type signature of convert:

d2f: Double -> Float
d2g: Double -> GL.GLclampf

I'd summarize the results, but apparently I took the blue pill and can't make heads or tails of what I'm seeing in the dump format...

Post by Daniel Fischer
In that case, the problem is probably that GHC doesn't see the realToFrac
applications because they're too deeply wrapped in your coordToCoord2D
calls.
If that is the problem, it might help to use {-# INLINE #-} pragmas on
coordToCoord2D, fromCartesian2D and toCartesian2D.
Can you try with realToFrac and the {-# INLINE #-} pragmas?

I tried inlining the functions you suggest with little effect. The realToFrac version (in this case I just set floatToFloat=realToFrac to save the search and replace effort) is just too heavily loaded to see any difference at all (98+% of CPU is spent in realToFrac). The same inlining using my definition of floatToFloat gave me a 10% improvement from 50% -> 46% of the CPU spent in floatToFloat and an inverse change in allocation to match.

Best I can tell, the inlining is being recognized, but just not changing much.

That's too much for my liking, a simple conversion from Double to Float
shouldn't take long, even if the Float is wrapped in newtypes (after all,
the newtypes don't exist at runtime).

Agreed. The rest of the application right now isn't doing a lot of work yet though-- I'm generating (pre-calculating, if Haskell is doing it's job) a list of 360*180 points on a sphere and dumping that to OpenGL which should be doing most of the dirty work in hardware. I'm not entirely sure why floatToFloat recalculates every iteration and isn't just cached, but I'm guessing it's because the floatToFloat is being done in an OpenGL callback within the IO monad. Eventually I'll be providing time-varying data anyway, so the conversions will have to be continuously recalculated in the end.

That comes out to 65000 conversions every 30ms, or about 2 million conversions a second. I'd probably just leave it at that except, as you've demonstrated, there is at least a factor of 3 or 4 to be gained somehow-- realToFrac can provide it under the right conditions.

Post by Greg
{-# RULES
"floatToFloat/id" floatToFloat=id
"floatToFloat x2" floatToFloat . floatToFloat = floatToFloat
#-}
Neither of which seems to fires in this application,

Yeah, I've been looking at the -ddump-simp-stats output. If I'm reading the documentation right, rules are enabled simply by invoking ghc with -O or -O2, right? I'm now not convinced any of my rewrite rules are firing-- or at least I can't seem to get them to again.

Post by Greg
I get a compile time error that I can't make sense of. It's asking me
to put a context on my rule, but I can't find any references on how to
do that...
-----------
Could not deduce (Num a)
from the context (Coord2D (Vertex2 a), Coord2D (a, a))
arising from a use of `pair2vertex'
at GCB/OpenGL/Geometry.hs:32:40-50
add (Num a) to the context of the RULE "coordToCoord2D/p2v2"
In the expression: pair2vertex
When checking the transformation rule "coordToCoord2D/p2v2"

Yes, there's a Num constraint on pair2vertex, but not on coordToCoord2D, so
it's not type correct.
You could try removing the Num constraint from pair2vertex or add the
constraint to the rule,
{-# RULES
"coordToCoord2D/p2v2" forall a. Num a => (coordToCoord2D :: (a,a) ->
GL.Vertex2 a) = pair2vertex
#-}
(well, I don't know whether that's the correct way, but you can try).

No, that doesn't do it. I tried a few variations on that and it always chokes on the => symbol or whatever other syntax I try to use. The Num constraint was added because it was needed on related functions (3 element vertices where the z was stuffed with 0, for example), so I got rid of those and the Num constraint. Doesn't matter, the rule still doesn't fire... =(

Cheers--
Greg

Daniel Fischer

2010-09-19 14:56:21 UTC

Post by Daniel Fischer
It would be interesting to see what core GHC produces for that (you
can get the core with the `-ddump-simpl' command line flag [redirect
stdout to a file] or with the ghc-core tool [available on hackage]).
If it runs as fast as realToFrac :: Double -> Float (with
optimisations), GHC must have rewritten realToFrac to double2Float#
and it should only do that if there are rewrite rules for GLclampf.

I'm not sure if you literally meant you wanted to see the output or not,

Yes, but only if you were willing to take the trouble of producing it.
I actually was more interested in the core for the real app, but the core
for the toy benchmark is already interesting (see below).

Post by Greg
but I've attached a zip of the dump files and my simple source file.
The dump file naming is cryptic, but the first letters refer to the
fTF: use the floatToFloat function in the source file
rTF: use the standard realToFrac
fRtR: use (fromRational . toRational)
d2f: Double -> Float
d2g: Double -> GL.GLclampf
I'd summarize the results, but apparently I took the blue pill and can't
make heads or tails of what I'm seeing in the dump format...

Okay, for the results for the Double -> Float conversion,
fromRational . toRational took ~3.35 seconds
floatToFloat took 32 ms
realToFrac took 8 ms

(always compiled with -O2; the times are slightly higher than the criterion
benchmarking results from Wednesday/Thursday, that's probably because those
ran pre-warmed while today's run-once started up cold [and included a call
to getCPUTime]).

Now to GLclampf. I remembered that I had installed OpenGL with one of my
old GHCs (turned out to be 6.10.3), so I could also run the tests for
Double -> GLclampf.
Unsurprisingly, fromRational . toRational and floatToFloat had the same
performance as for Double -> Float. Equally unsurprisingly, were it not for
your results and the core you sent, realToFrac had the same performance as
fromRational . toRational.

In the core you sent for realToFrac :: Double -> GLclampf, we find the loop
for summing a list of GLclampf:

Rec {
$wlgo_r1wv :: GHC.Prim.Float#
-> [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf]
-> GHC.Prim.Float#
GblId
[Arity 2
NoCafRefs
Str: DmdType LS]
$wlgo_r1wv =
\ (ww_s1vV :: GHC.Prim.Float#)
(w_s1vX :: [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf]) ->
case w_s1vX of _ {
[] -> ww_s1vV;
: x_aVE xs_aVF ->
case x_aVE of _ { GHC.Types.F# y_a13F ->
$wlgo_r1wv (GHC.Prim.plusFloat# ww_s1vV y_a13F) xs_aVF
}
}
end Rec }

Wow, did you remove the casting annotations or does it really match a
GLclampf against the Float constructor F# without any ado?
If the latter, which compiler version have you?
Just for the record, 6.10.3 produced the same code, but with several levels
of casting from Float to GLclampf.

More interesting is the generation of the list:

Rec {
go_r1wx :: GHC.Prim.Int#
-> [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf]
GblId
[Arity 1
NoCafRefs
Str: DmdType L]
go_r1wx =
\ (x_a13o :: GHC.Prim.Int#) ->
GHC.Types.:
@ Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf
(case GHC.Prim./## 1.0 (GHC.Prim.int2Double# x_a13o)
of wild2_a14i { __DEFAULT ->
GHC.Types.F# (GHC.Prim.double2Float# wild2_a14i)
})
(case x_a13o of wild_B1 {
__DEFAULT -> go_r1wx (GHC.Prim.+# wild_B1 1);
100000 ->
GHC.Types.[] @ Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf
})
end Rec }

Wowwowwow, it conses a Float to a list of GLclampf without even mentioning
a cast. Since it feels free to do that, no wonder that it uses
double2Float#.
Hrm, okay, perhaps a new version of OpenGL[Raw]? Nope, 2.4.0.1 and 1.1.0.1,
what I have with 6.10.3.
So, perhaps it's 6.12 vs. 6.10? Install OpenGL for 6.12.3, try, nope, same
as 6.10.3, the summing is identical except for the casting annotations, but
the generation goes through fromRational and toRational [expected, because
there are no rewrite rules in OpenGLRaw].

What compiler are you using? HEAD? The core doesn't look like HEAD's core
to me, but that might be because nothing except main is exported.

Okay, so I threw a couple of rewrite rules into OpenGLRaw, reinstalled and
reran, now realToFrac gets properly rewritten to double2Float# (with
casts).

Post by Daniel Fischer
In that case, the problem is probably that GHC doesn't see the
realToFrac applications because they're too deeply wrapped in your
coordToCoord2D calls.

Okay, your compiler *does* rewrite realToFrac :: Double -> GLclampf to
double2Float#, at least when the situation is simple enough, although there
are no rewrite rules in the package for that.
Looks like a fortuitous bug.
But it doesn't do the rewriting in the real app, so it's probably indeed
too deeply wrapped there.

Post by Daniel Fischer
If that is the problem, it might help to use {-# INLINE #-} pragmas on
coordToCoord2D, fromCartesian2D and toCartesian2D.
Can you try with realToFrac and the {-# INLINE #-} pragmas?

I tried inlining the functions you suggest with little effect. The
realToFrac version (in this case I just set floatToFloat=realToFrac to
save the search and replace effort) is just too heavily loaded to see
any difference at all (98+% of CPU is spent in realToFrac). The same
inlining using my definition of floatToFloat gave me a 10% improvement
from 50% -> 46% of the CPU spent in floatToFloat and an inverse change
in allocation to match.
Best I can tell, the inlining is being recognized, but just not changing much.

Looking at the Coord stuff more closely, you'd probably need much more
inlining to get a good effect. And you probably need a bit more strictness
too.

============================================================

--Coord2D is a typeclass I created to hold 2D data
data Cartesian2D a = Cartesian2D a a deriving (Show, Eq, Read)

-- Needs testing, but I suspect
{-
data Cartesian2D a = Cartesian2D !a !a deriving (...)

or even

data Cartesian2D a = Cartesian2D {-# UNPACK #-} !a {-# UNPACK #-} !a
deriving (...)
-}
-- would have a beneficial effect.

{- Pair instances -}
instance (RealFloat a, RealFloat b) => Coord2D (a,b) where
  xComponent = realToFrac . fst
  yComponent = realToFrac . snd
  fromCartesian2D p = ((xComponent p),(yComponent p))

-- That might be too lazy, perhaps
{-
xComponent (x,_) = realToFrac x
yComponent (_,y) = realToFrac y
fromCartesian2D (Cartesian2D x y) = (x,y)
-}
-- will be better

-- anyhow, maybe you need to inline all methods of Coord2D to get the rules
to fire:

class Coord2D a where
{-# INLINE xComponent #-}
  xComponent :: (RealFloat b) => a -> b
{-# INLINE yComponent #-}
  yComponent :: (RealFloat b) => a -> b
{-# INLINE toCartesian2D #-}
  toCartesian2D :: (RealFloat b) => a -> Cartesian2D b
  toCartesian2D p = Cartesian2D (xComponent p) (yComponent p)
{-# INLINE fromCartesian2D #-}
  fromCartesian2D :: (RealFloat b) => Cartesian2D b -> a

-- I'm rather convinced inlining the component functions will be good, but
-- there's a good chance that they're small enough to be inlined anyway.

-- The inlining of the to/fromCratesian2D functions is doubtful, because

--and this function allows conversion between coordinate representations
coordToCoord2D :: (Coord2D a, Coord2D b) => a -> b
coordToCoord2D = fromCartesian2D . toCartesian2D

-- cries loudly for

{-# RULES
"toCart/fromCart" forall p. toCartesian2D (fromCartesian2D p) = p
#-}

-- whenever that's possible

-- so, perhaps first try to rewrite, whenever that's possible, afterwards
inline, hence

-- {-# INLINE [2] toCartesian2D #-}
-- {-# INLINE [2] fromCartesian2D #-}
-- {-# RULES
-- "toCart/fromCart" [~2] forall p. toCartesian (fromCartesian p) = p
-- #-}

-- dunno whether that works, but -ddump-simpl-stats should tell
============================================================

Finally, there's one other thing to try, with or without rules/inlining:

coordToVertex2 :: Coord2D a => a -> (GL.Vertex2 GL.GLclampf)
coordToVertex2 = coordToCoord2D

GLclampf is a newtype wrapper around a newtype wrapper around Float.
Coercing between newtype and original is supposed to be safe, so

import Unsafe.Coerce

floatToGLclampf :: Float -> GL.GLclampf
floatToGLclampf = unsafeCoerce

coordToVertex2 c =
case coordToCoord2D c of
(x,y) -> GL.Vertex2 (floatToGLclampf x) (floatToGLclampf y)

That way, we circumvent a potentially expensive call to
realToFrac :: a -> GLclampf
for a = Double or a = Float and split it into a no-op (unsafeCoerce) and a
hopefully cheap conversion to Float.

Post by Greg
And still ran faster than floatToFloat. However there's no denying
that floatToFloat runs *much* faster than realToFrac in the larger
application. Profiling shows floatToFloat taking about 50% of my CPU

That's too much for my liking, a simple conversion from Double to
Float shouldn't take long, even if the Float is wrapped in newtypes
(after all, the newtypes don't exist at runtime).

Code? Maybe you have to give a name for it to be cached.

Post by Greg
but I'm guessing it's because the floatToFloat is being
done in an OpenGL callback within the IO monad. Eventually I'll be
providing time-varying data anyway, so the conversions will have to be
continuously recalculated in the end.
That comes out to 65000 conversions every 30ms, or about 2 million
conversions a second. I'd probably just leave it at that except, as
you've demonstrated, there is at least a factor of 3 or 4 to be gained
somehow-- realToFrac can provide it under the right conditions.

Post by Greg
{-# RULES
"floatToFloat/id" floatToFloat=id
"floatToFloat x2" floatToFloat . floatToFloat = floatToFloat
#-}

I'm not sure how the rule-spotting works with compositions, whether it
matches `foo . bar' with `foo (bar x)' [one in the code, the other in the
rule], it might be necessary to give the rule in both forms.

Post by Greg
Neither of which seems to fires in this application,

GHC reports fired rules with -ddump-simpl-stats.
Getting rules to fire is a little brittle, GHC does not try too hard
to match expressions with rules, and if several rules match, it
chooses one arbitrarily, so your rules may have been missed because
the actual code looked different (perhaps because other rewrite rules
fired first).

Yeah, I've been looking at the -ddump-simp-stats output. If I'm reading
the documentation right, rules are enabled simply by invoking ghc with
-O or -O2, right?

Right, -O implies -fenable-rewrite-rules (and hence -O2 too).
On the other hand, you can't have rewrite-rules without -O [that is, you
can pass -fenable-rwerite-rules on the command line without -O, it will
just have no effect]. Presumably the flag exists for its negation, so you
can invoke GHC with -O -fno-enable-rewrite-rules to have the rules not
firing.

Post by Greg
I'm now not convinced any of my rewrite rules are
firing-- or at least I can't seem to get them to again.

If they fire, -ddump-simpl-stats tells you, there's a piece like

9 RuleFired
1 ==#->case
1 >#
1 eftInt
1 fold/build
1 fromIntegral/Int->Double
1 int2Float#
1 realToFrac/Double->Float
1 unpack
1 unpack-list

in the dump, if it contains the name of one of your rules, it fired n
times, otherwise it didn't fire.

Post by Greg
No, that doesn't do it. I tried a few variations on that and it always
chokes on the => symbol or whatever other syntax I try to use. The Num
constraint was added because it was needed on related functions (3
element vertices where the z was stuffed with 0, for example), so I got
rid of those and the Num constraint. Doesn't matter, the rule still
doesn't fire... =(

Might have been inlined before the rule got a chance to fire.

Post by Greg
Cheers--
Greg

Daniel Fischer

2010-09-20 14:53:25 UTC