Discussion:
[ANN] CalipeL: a benchmarking tool for Smalltalk/X and Pharo
(too old to reply)
Jan Vrany
2015-10-23 08:47:41 UTC
Permalink
Hi there,

After more than 2 years of (time-to-time) development and about 
that much time of use, I'd like to announce CalipeL, a tool for
benchmarking and monitoring performance regressions.

The basic ideas that drove the development:

* Benchmarking and (especially) interpreting benchmark results 
  is always a monkey business. The tool should produce raw numbers,
  letting the user to use whichever statistics she need to make up
  (desired) results.
* Benchmark results should be kept and managed at a single place so
  one can view and retrieve all past benchmark results pretty much 
  the same way as one can view and retrieve past versions of 
  the software from a source code management tool.

Features:

- simple - creating a benchmark is as simple as writing a method 
  in a class
- flexible - a special set-up and/or warm-up routines could be
  specified at benchmark-level as well as set of parameters 
  to allow fine-grained measurements under different conditions
- batch runner - contains a batch runner allowing one to run 
  benchmarks from a command line or at CI servers such as Jenkins.
- web - comes with simple web interface to gather and process 
  benchmark results. However, the web application would deserve
  some more work.

Repository:

  https://bitbucket.org/janvrany/jv-calipel

  http://smalltalkhub.com/#!/~JanVrany/CalipeL-S (read-only export
  from the above and Pharo-specific code)

More information:

  https://bitbucket.org/janvrany/jv-calipel/wiki/Home

I have been using CalipeL for benchmarking and keeping track of 
performance of Smalltalk/X VM, STX:LIBJAVA, a PetitParser compiler 
and other code I was working over the time.

Finally, I'd like to thank to Marcel Hlopko for his work on the 
web application and Jan Kurs for his comments.

I hope some of you may find it useful. If you have any comments 
or questions, do not hesitate and let me know!

Regards, Jan
Max Leske
2015-11-01 11:11:33 UTC
Permalink
Hi Jan,

That looks pretty cool!
We use SMark (http://smalltalkhub.com/#!/~PharoExtras/SMark) for benchmarking and CI integration for Fuel. If you know SMark, could you give me an idea of what the differences are?

Cheers,
Max
Post by Jan Vrany
Hi there,
After more than 2 years of (time-to-time) development and about
that much time of use, I'd like to announce CalipeL, a tool for
benchmarking and monitoring performance regressions.
* Benchmarking and (especially) interpreting benchmark results
is always a monkey business. The tool should produce raw numbers,
letting the user to use whichever statistics she need to make up
(desired) results.
* Benchmark results should be kept and managed at a single place so
one can view and retrieve all past benchmark results pretty much
the same way as one can view and retrieve past versions of
the software from a source code management tool.
- simple - creating a benchmark is as simple as writing a method
in a class
- flexible - a special set-up and/or warm-up routines could be
specified at benchmark-level as well as set of parameters
to allow fine-grained measurements under different conditions
- batch runner - contains a batch runner allowing one to run
benchmarks from a command line or at CI servers such as Jenkins.
- web - comes with simple web interface to gather and process
benchmark results. However, the web application would deserve
some more work.
https://bitbucket.org/janvrany/jv-calipel
http://smalltalkhub.com/#!/~JanVrany/CalipeL-S (read-only export
from the above and Pharo-specific code)
https://bitbucket.org/janvrany/jv-calipel/wiki/Home
I have been using CalipeL for benchmarking and keeping track of
performance of Smalltalk/X VM, STX:LIBJAVA, a PetitParser compiler
and other code I was working over the time.
Finally, I'd like to thank to Marcel Hlopko for his work on the
web application and Jan Kurs for his comments.
I hope some of you may find it useful. If you have any comments
or questions, do not hesitate and let me know!
Regards, Jan
Jan Vrany
2015-11-01 22:45:18 UTC
Permalink
Hi Max,

I looked at some version of SMark years ago and never used 
it extensively, so I might be wrong, but: 

* SMark executor does some magic with numbers. It tries to 
  calculate a number of iterations to run in order to get 
  "statistically meaningful results". Maybe it's me, but
  I could not fully understand what it does and why it does it
  so. 
  CalipeL does no magic - it gives you raw numbers (no average, no
mean,
  rather a sequence of measurements). It's up to the one who processes
and
  interprets the data to use whatever method she likes (and whichever
  gives the numbers she'd like to see :-) This transparency was 
  important for our needs. 

* SMark, IIRC, requires benchmarks to inherit from some base class
  (like SUnit). Also, not sure if SMark allows you to specify a warmup-
  phase (handy for example to measure peak performance when caches are
  filled or so). 
  CalipeL, OTOH, uses method annotations to describe the benchmark,
  so one can turn a regular SUnit test method into benchmark as simply
  as annotating it with <benchmark>. A warmup method and setup/teardown
  methods can be specified per-benchmark. 

* SMark has no support for parametrization. 
  In Calipel, support for benchmark parameters was one of the 
  requirements from the very beginning. A little example: 
  I had to optimize performance of Object>>perform: family of methods
  for they was thought to be slowish. I came up with several 
  variants of of a "better" implementation, no knowing which one 
  is the best. How does each of them behave under different workloads? 
  Like - how the number of distinct receivier classes affects the
performance? 
  How the number of distinct selectors affects the performance? 
  Is the performance different when receiver classes are distributed
  uniformly or normally (which seems to be more common case). 
  Same for selectors? Is 256 row, 2-way associative cache
  better than 128 rows, 4-way associative? 
  You have number of parameters, for each parameter you define
  a number of values and CalipeL work out all possible combinations
  and run benchmarks using each. Without parametrization, the number
  of benchmark methods would grow exponentially, making hard 
  to experiment with different setups. For me, this is one of 
  key things. 

* SMark measures time only. 
  CalipeL measures time, too, but has the facility to provide a 
  user-defined "measurement instrument", which can be anything 
  (what can be measured, indeed). For example, for some web 
  application the execution time might not be that useful, perhaps
  a number of SQL queries it makes is more important. No problem, 
  define your own measurement instrument and tell CalipeL to use it
  in addition to time, number of GCs, you name it. All results of 
  all instruments are part of machine-readable report, indeed. 

* SMark had no support for "system" profilers and similar. 
  CalipeL integrates with systemtap/dtrace and cachegrind so one 
  can have a full profile, including VM code and see things like 
  L1/L2 I/D cache misses, mispredicted branches or count events 
  like context switches, monitor signaling, context evacuation. 
  Useful only for VM engineers I think, but I cannot image doing 
  my work without this. Available only for Smalltalk/X, but should
  not be a big deal adding this to Pharo (simple plugin would do it, 
  IMO)

* Finally, SMark spits out a report and that's it. 
  CalipeL, OTOH, goes beyond that. It tries for provide tools 
  to gather, store and query results in a centralised way so 
  nothing is forgotten.
  (no more: hmm, where are the results of #perform: benchmarks
  I run three months ago? Is it this file? Or that file? Or did I 
  deleted them when my laptop run out of disk space?) 
  And yes, I know that in this area there's a lot of space for
  improvements. What we have now is certainly not ideal, to put
  it mildly :-) 


Hope that gives you the idea. 

Jan
Post by Max Leske
Hi Jan,
That looks pretty cool!
We use SMark (http://smalltalkhub.com/#!/~PharoExtras/SMark) for
benchmarking and CI integration for Fuel. If you know SMark, could
you give me an idea of what the differences are?
Cheers,
Max
Post by Jan Vrany
Hi there,
After more than 2 years of (time-to-time) development and about
that much time of use, I'd like to announce CalipeL, a tool for
benchmarking and monitoring performance regressions.
* Benchmarking and (especially) interpreting benchmark results
  is always a monkey business. The tool should produce raw numbers,
  letting the user to use whichever statistics she need to make up
  (desired) results.
* Benchmark results should be kept and managed at a single place so
  one can view and retrieve all past benchmark results pretty much
  the same way as one can view and retrieve past versions of
  the software from a source code management tool.
- simple - creating a benchmark is as simple as writing a method
  in a class
- flexible - a special set-up and/or warm-up routines could be
  specified at benchmark-level as well as set of parameters
  to allow fine-grained measurements under different conditions
- batch runner - contains a batch runner allowing one to run
  benchmarks from a command line or at CI servers such as Jenkins.
- web - comes with simple web interface to gather and process
  benchmark results. However, the web application would deserve
  some more work.
  https://bitbucket.org/janvrany/jv-calipel
  http://smalltalkhub.com/#!/~JanVrany/CalipeL-S (read-only export
  from the above and Pharo-specific code)
  https://bitbucket.org/janvrany/jv-calipel/wiki/Home
I have been using CalipeL for benchmarking and keeping track of
performance of Smalltalk/X VM, STX:LIBJAVA, a PetitParser compiler
and other code I was working over the time.
Finally, I'd like to thank to Marcel Hlopko for his work on the
web application and Jan Kurs for his comments.
I hope some of you may find it useful. If you have any comments
or questions, do not hesitate and let me know!
Regards, Jan
Max Leske
2015-11-05 15:46:45 UTC
Permalink
Post by Jan Vrany
Hi Max,
I looked at some version of SMark years ago and never used
* SMark executor does some magic with numbers. It tries to
calculate a number of iterations to run in order to get
"statistically meaningful results". Maybe it's me, but
I could not fully understand what it does and why it does it
so.
CalipeL does no magic - it gives you raw numbers (no average, no
mean,
rather a sequence of measurements). It's up to the one who processes
and
interprets the data to use whatever method she likes (and whichever
gives the numbers she'd like to see :-) This transparency was
important for our needs.
* SMark, IIRC, requires benchmarks to inherit from some base class
(like SUnit). Also, not sure if SMark allows you to specify a warmup-
phase (handy for example to measure peak performance when caches are
filled or so).
CalipeL, OTOH, uses method annotations to describe the benchmark,
so one can turn a regular SUnit test method into benchmark as simply
as annotating it with <benchmark>. A warmup method and setup/teardown
methods can be specified per-benchmark.
* SMark has no support for parametrization.
In Calipel, support for benchmark parameters was one of the
I had to optimize performance of Object>>perform: family of methods
for they was thought to be slowish. I came up with several
variants of of a "better" implementation, no knowing which one
is the best. How does each of them behave under different workloads?
Like - how the number of distinct receivier classes affects the
performance?
How the number of distinct selectors affects the performance?
Is the performance different when receiver classes are distributed
uniformly or normally (which seems to be more common case).
Same for selectors? Is 256 row, 2-way associative cache
better than 128 rows, 4-way associative?
You have number of parameters, for each parameter you define
a number of values and CalipeL work out all possible combinations
and run benchmarks using each. Without parametrization, the number
of benchmark methods would grow exponentially, making hard
to experiment with different setups. For me, this is one of
key things.
* SMark measures time only.
CalipeL measures time, too, but has the facility to provide a
user-defined "measurement instrument", which can be anything
(what can be measured, indeed). For example, for some web
application the execution time might not be that useful, perhaps
a number of SQL queries it makes is more important. No problem,
define your own measurement instrument and tell CalipeL to use it
in addition to time, number of GCs, you name it. All results of
all instruments are part of machine-readable report, indeed.
* SMark had no support for "system" profilers and similar.
CalipeL integrates with systemtap/dtrace and cachegrind so one
can have a full profile, including VM code and see things like
L1/L2 I/D cache misses, mispredicted branches or count events
like context switches, monitor signaling, context evacuation.
Useful only for VM engineers I think, but I cannot image doing
my work without this. Available only for Smalltalk/X, but should
not be a big deal adding this to Pharo (simple plugin would do it,
IMO)
* Finally, SMark spits out a report and that's it.
CalipeL, OTOH, goes beyond that. It tries for provide tools
to gather, store and query results in a centralised way so
nothing is forgotten.
(no more: hmm, where are the results of #perform: benchmarks
I run three months ago? Is it this file? Or that file? Or did I
deleted them when my laptop run out of disk space?)
And yes, I know that in this area there's a lot of space for
improvements. What we have now is certainly not ideal, to put
it mildly :-)
Hope that gives you the idea.
Thanks Jan! That was quite thorough. I’ll have to take a look at CalipeL sometime. Sure sounds great :)

Cheers,
Max
Post by Jan Vrany
Jan
Post by Max Leske
Hi Jan,
That looks pretty cool!
We use SMark (http://smalltalkhub.com/#!/~PharoExtras/SMark) for
benchmarking and CI integration for Fuel. If you know SMark, could
you give me an idea of what the differences are?
Cheers,
Max
Post by Jan Vrany
Hi there,
After more than 2 years of (time-to-time) development and about
that much time of use, I'd like to announce CalipeL, a tool for
benchmarking and monitoring performance regressions.
* Benchmarking and (especially) interpreting benchmark results
is always a monkey business. The tool should produce raw numbers,
letting the user to use whichever statistics she need to make up
(desired) results.
* Benchmark results should be kept and managed at a single place so
one can view and retrieve all past benchmark results pretty much
the same way as one can view and retrieve past versions of
the software from a source code management tool.
- simple - creating a benchmark is as simple as writing a method
in a class
- flexible - a special set-up and/or warm-up routines could be
specified at benchmark-level as well as set of parameters
to allow fine-grained measurements under different conditions
- batch runner - contains a batch runner allowing one to run
benchmarks from a command line or at CI servers such as Jenkins.
- web - comes with simple web interface to gather and process
benchmark results. However, the web application would deserve
some more work.
https://bitbucket.org/janvrany/jv-calipel
http://smalltalkhub.com/#!/~JanVrany/CalipeL-S (read-only export
from the above and Pharo-specific code)
https://bitbucket.org/janvrany/jv-calipel/wiki/Home
I have been using CalipeL for benchmarking and keeping track of
performance of Smalltalk/X VM, STX:LIBJAVA, a PetitParser compiler
and other code I was working over the time.
Finally, I'd like to thank to Marcel Hlopko for his work on the
web application and Jan Kurs for his comments.
I hope some of you may find it useful. If you have any comments
or questions, do not hesitate and let me know!
Regards, Jan
s***@stefan-marr.de
2015-11-05 16:17:09 UTC
Permalink
Hi Jan,
Hi Max:

I guess, the main issues is missing documentation…
Even so, there are class comments…
Post by Jan Vrany
Hi Max,
I looked at some version of SMark years ago and never used
* SMark executor does some magic with numbers.
Nope. It does only do that if you ask for it. However, granted, that’s the standard setting, because it is supposed to be used conveniently from within the image.

The SMark design knows the concepts reporter (how and what data to report), runner (how to execute benchmarks), suite (the benchmarks), timer (should be named gauge or something, can be everything, doesn’t have to be time).
Post by Jan Vrany
It tries to
calculate a number of iterations to run in order to get
"statistically meaningful results". Maybe it's me, but
I could not fully understand what it does and why it does it
so.
CalipeL does no magic - it gives you raw numbers (no average, no
mean,
rather a sequence of measurements).
See the ReBenchHarness, that’s giving you exactly that as alternative standard setting.
Post by Jan Vrany
* SMark, IIRC, requires benchmarks to inherit from some base class
(like SUnit).
Require is a strong word, as long as you implement the interface of SMarkSuite you can inherit from where ever you want. It’s Smalltalk after all.
Post by Jan Vrany
Also, not sure if SMark allows you to specify a warmup-
phase (handy for example to measure peak performance when caches are
filled or so).
There is the concept of #setup/teardown methods.
And, a runner can do what it wants/needs to reach warmup, too.
For instance, the SMarkCogRunner will make sure that all code is compiled before starting to measure.
Post by Jan Vrany
CalipeL, OTOH, uses method annotations to describe the benchmark,
so one can turn a regular SUnit test method into benchmark as simply
as annotating it with <benchmark>.
Ok, that’s not possible.
Post by Jan Vrany
A warmup method and setup/teardown
methods can be specified per-benchmark.
We got that too.
Post by Jan Vrany
* SMark has no support for parametrization.
Well, there is the #problemSize parameter, but that is indeed rather simplistic.
Post by Jan Vrany
* SMark measures time only.
Nope, the SMarkTimer can measure what they want. (and it even got a class comment ;))
Post by Jan Vrany
* SMark had no support for “system" profilers and similar.
That’s absent, true.
Post by Jan Vrany
* Finally, SMark spits out a report and that’s it.
Well, reports and raw data. I use ReBench [1], and pipe the raw data directly into my latex/knitr/R tool chain to generate graphs/numbers in my papers (example sec. 4 [2], that’s based on a latex file with embedded R).

So, I’d say there are some interesting differences.
But, much of the mentioned things seem just to be missing ‘documentation’/communication ;)

Best regards
Stefan

[1] https://github.com/smarr/ReBench
[2] http://stefan-marr.de/papers/oopsla-marr-ducasse-meta-tracing-vs-partial-evaluation/
Max Leske
2015-11-05 17:20:38 UTC
Permalink
Thanks Stefan for the follow up.
Post by Max Leske
Hi Jan,
I guess, the main issues is missing documentation…
Even so, there are class comments…
Post by Jan Vrany
Hi Max,
I looked at some version of SMark years ago and never used
* SMark executor does some magic with numbers.
Nope. It does only do that if you ask for it. However, granted, that’s the standard setting, because it is supposed to be used conveniently from within the image.
The SMark design knows the concepts reporter (how and what data to report), runner (how to execute benchmarks), suite (the benchmarks), timer (should be named gauge or something, can be everything, doesn’t have to be time).
Post by Jan Vrany
It tries to
calculate a number of iterations to run in order to get
"statistically meaningful results". Maybe it's me, but
I could not fully understand what it does and why it does it
so.
CalipeL does no magic - it gives you raw numbers (no average, no
mean,
rather a sequence of measurements).
See the ReBenchHarness, that’s giving you exactly that as alternative standard setting.
Post by Jan Vrany
* SMark, IIRC, requires benchmarks to inherit from some base class
(like SUnit).
Require is a strong word, as long as you implement the interface of SMarkSuite you can inherit from where ever you want. It’s Smalltalk after all.
Post by Jan Vrany
Also, not sure if SMark allows you to specify a warmup-
phase (handy for example to measure peak performance when caches are
filled or so).
There is the concept of #setup/teardown methods.
And, a runner can do what it wants/needs to reach warmup, too.
For instance, the SMarkCogRunner will make sure that all code is compiled before starting to measure.
Post by Jan Vrany
CalipeL, OTOH, uses method annotations to describe the benchmark,
so one can turn a regular SUnit test method into benchmark as simply
as annotating it with <benchmark>.
Ok, that’s not possible.
Post by Jan Vrany
A warmup method and setup/teardown
methods can be specified per-benchmark.
We got that too.
Post by Jan Vrany
* SMark has no support for parametrization.
Well, there is the #problemSize parameter, but that is indeed rather simplistic.
Post by Jan Vrany
* SMark measures time only.
Nope, the SMarkTimer can measure what they want. (and it even got a class comment ;))
Post by Jan Vrany
* SMark had no support for “system" profilers and similar.
That’s absent, true.
Post by Jan Vrany
* Finally, SMark spits out a report and that’s it.
Well, reports and raw data. I use ReBench [1], and pipe the raw data directly into my latex/knitr/R tool chain to generate graphs/numbers in my papers (example sec. 4 [2], that’s based on a latex file with embedded R).
So, I’d say there are some interesting differences.
But, much of the mentioned things seem just to be missing ‘documentation’/communication ;)
Best regards
Stefan
[1] https://github.com/smarr/ReBench
[2] http://stefan-marr.de/papers/oopsla-marr-ducasse-meta-tracing-vs-partial-evaluation/
Jan Vrany
2015-11-05 20:38:14 UTC
Permalink
Hi Stefan, 

OK, you proved I was wrong. I said I could.  
Thanks for clarification!  

Jan
Post by Max Leske
Hi Jan,
I guess, the main issues is missing documentation…
Even so, there are class comments…
Post by Jan Vrany
Hi Max,
I looked at some version of SMark years ago and never used
* SMark executor does some magic with numbers.
Nope. It does only do that if you ask for it. However, granted,
that’s the standard setting, because it is supposed to be used
conveniently from within the image.
The SMark design knows the concepts reporter (how and what data to
report), runner (how to execute benchmarks), suite (the benchmarks),
timer (should be named gauge or something, can be everything, doesn’t
have to be time).
Post by Jan Vrany
It tries to
  calculate a number of iterations to run in order to get
  "statistically meaningful results". Maybe it's me, but
  I could not fully understand what it does and why it does it
  so.
  CalipeL does no magic - it gives you raw numbers (no average, no
mean,
  rather a sequence of measurements).
See the ReBenchHarness, that’s giving you exactly that as alternative standard setting.
Post by Jan Vrany
* SMark, IIRC, requires benchmarks to inherit from some base class
  (like SUnit).
Require is a strong word, as long as you implement the interface of
SMarkSuite you can inherit from where ever you want. It’s Smalltalk
after all.
Post by Jan Vrany
Also, not sure if SMark allows you to specify a warmup-
  phase (handy for example to measure peak performance when caches are
  filled or so).
There is the concept of #setup/teardown methods.
And, a runner can do what it wants/needs to reach warmup, too.
For instance, the SMarkCogRunner will make sure that all code is
compiled before starting to measure.
Post by Jan Vrany
  CalipeL, OTOH, uses method annotations to describe the benchmark,
  so one can turn a regular SUnit test method into benchmark as simply
  as annotating it with <benchmark>.
Ok, that’s not possible.
Post by Jan Vrany
A warmup method and setup/teardown
  methods can be specified per-benchmark.
We got that too.
Post by Jan Vrany
* SMark has no support for parametrization.
Well, there is the #problemSize parameter, but that is indeed rather simplistic.
Post by Jan Vrany
* SMark measures time only.
Nope, the SMarkTimer can measure what they want. (and it even got a class comment ;))
Post by Jan Vrany
* SMark had no support for “system" profilers and similar.
That’s absent, true.
Post by Jan Vrany
* Finally, SMark spits out a report and that’s it.
Well, reports and raw data. I use ReBench [1], and pipe the raw data
directly into my latex/knitr/R tool chain to generate graphs/numbers
in my papers (example sec. 4 [2], that’s based on a latex file with
embedded R).
So, I’d say there are some interesting differences.
But, much of the mentioned things seem just to be missing
‘documentation’/communication ;)
Best regards
Stefan
[1] https://github.com/smarr/ReBench
[2] http://stefan-marr.de/papers/oopsla-marr-ducasse-meta-tracing-vs-
partial-evaluation/
Loading...