DMTN-043: Redesign of afw::math::Statistics

  • Pim Schellart

Latest Revision: 2017-04-14

Note

This document captures the redesign work of the lsst::afw::math::Statistics functionality.

Warning

This is an initial draft. A partially working proof of concept implementation, which sketches out the general design and has been benchmarked to be equivalent to the current Statistics performance on 4k x 4k noise images, is available at:

https://github.com/lsst/afw/u/pschella/DM-9979

1   Third party packages

ndarray: ndarray (ndarray).

2   LSST packages used

afw: lsst.afw.math (afwMath). afw: lsst.afw.image (afwImage).

3   Requirements

3.1   Critical requirements From LDM-151 (chapter 9.7)

  1. Compute various robust statistics for central tendency and distribution widths, measured on 2-d and 1-d arrays.
  2. Needs to be able to make use of mask and uncertainty arrays.
  3. Needs to work on 2-D Images and MaskedImages.
  4. Needs to work on stacks of aligned pixels for coaddition.

3.2   Keep from current implementation

  1. The implementation will be in C++.
  2. Equivalent performance (but some micro optimizations may be dropped).
  3. For efficiency multiple statistics will be computed together in a single pass wherever possible.
  4. Enable statistics to be calculated for afw::image::Image<T>, afw::image::MaskedImage<T> and std::vector<T> objects.
  5. Support all currently supported statistics. I.e.
    • number of sample points
    • sample mean
    • sample standard deviation
    • sample variance
    • sample median
    • sample inter-quartile range
    • sample N-sigma clipped mean
    • sample N-sigma clipped stdev
    • sample N-sigma clipped variance
    • sample minimum
    • sample maximum
    • find sum of pixels in the image
    • find mean value of square of pixel values
    • or-mask of all pixels used
    • errors of requested quantities
  6. Support statistics of afw::image::Mask<T>.
  7. Allow specification with bitmask which pixels to include / exclude in statistic.

3.3   Additional functionality in new implementation

  1. Change to ndarray::Array (instead of afw::image::Image<T> for type to operate on).
  2. Configuration using pex::config.
  3. Allow for adding new statistics to calculate (at compile time, not (necessarily) run time), and enable current statistics to be calculated differently.
  4. Add ability to see which pixels were clipped in output (e.g. as bitmask or as list of indices).
  5. Ensure design is compatible with afw::detection::SpanSet::applyFunctor.

3.4   Out of scope

  1. Adding new statistics.

4   Interfaces

All statistics requests are handled through the afw::math::computeStatistics function (which roughly mimics scipy.stats.describe). In Python its signature is as follows:

computeStatistics(img,
                   msk = None,
                   wgt = None,
                   var = None,
                   errors = False,
                   npoint = False,
                   mean = False,
                   stdev = False,
                   variance = False,
                   median = False,
                   iqrange = False,
                   meanclip = False,
                   stdevclip = False,
                   varianceclip = False,
                   min = False,
                   max = False,
                   sum = False,
                   meansquare = False,
                   ormask = False,
                   sctrl = StatisticsControl())

Where img, msk, wgt and var are typically a 1 dimensional numpy.ndarray, or any of the other supported overloads (see below).

In C++ the signature is:

template <typename ImageT, typename MaskT, typename WeightT, typename VarianceT>
StatisticsResult computeStatistics(ImageT const &img,
                          MaskT const *msk,     // (nullptr if not used)
                          WeightT const *wgt,   // (nullptr if not used)
                          VarianceT const *var, // (nullptr if not used)
                          int const flags,
                          StatisticsControl const &sctrl = StatisticsControl());

Where img, msk, wgt and var are typically ndarray::Array<T, 1, 1> of the appropriate type T, and where flags is a bitwise combination of:

enum Property {
    NOTHING = 0x0,         ///< We don't want anything
    ERRORS = 0x1,          ///< Include errors of requested quantities
    NPOINT = 0x2,          ///< number of sample points
    MEAN = 0x4,            ///< estimate sample mean
    STDEV = 0x8,           ///< estimate sample standard deviation
    VARIANCE = 0x10,       ///< estimate sample variance
    MEDIAN = 0x20,         ///< estimate sample median
    IQRANGE = 0x40,        ///< estimate sample inter-quartile range
    MEANCLIP = 0x80,       ///< estimate sample N-sigma clipped mean (N set in StatisticsControl, default=3)
    STDEVCLIP = 0x100,     ///< estimate sample N-sigma clipped stdev (N set in StatisticsControl, default=3)
    VARIANCECLIP = 0x200,  ///< estimate sample N-sigma clipped variance
                           ///<  (N set in StatisticsControl, default=3)
    MIN = 0x400,           ///< estimate sample minimum
    MAX = 0x800,           ///< estimate sample maximum
    SUM = 0x1000,          ///< find sum of pixels in the image
    MEANSQUARE = 0x2000,   ///< find mean value of square of pixel values
    ORMASK = 0x4000        ///< get the or-mask of all pixels used.
};

This is more natural in C++ for multiple options. An alternative would be to absorb the requested properties into StatisticsControl.

4.1   Overloads

The following overloads are present (also available from Python with the same syntax as above).

  • Handle image::Image arguments with an optional image::Mask.
template <typename ImagePixelT, typename MaskPixelT, typename WeightPixelT>
StatisticsResult computeStatistics(lsst::afw::image::Image<ImagePixelT> const &img,
                          lsst::afw::image::Mask<MaskPixelT> const *msk,
                          lsst::afw::image::Image<WeightPixelT> const *wgt,
                          int const flags,
                          StatisticsControl const &sctrl = StatisticsControl());
  • Handle image::MaskedImage.
template <typename ImagePixelT, typename WeightPixelT>
StatisticsResult computeStatistics(image::MaskedImage<ImagePixelT> const &mimg,
                          lsst::afw::image::Image<WeightPixelT> const *wgt,
                          int const flags,
                          StatisticsControl const &sctrl = StatisticsControl());
  • Handle image::Mask.
template <typename MaskPixelT>
StatisticsResult computeStatistics(image::Mask<MaskPixelT> const &msk,
                          int const flags,
                          statisticscontrol const &sctrl = statisticscontrol());

These overloads mostly just pass through the respective ndarrays obtained with .getArray().

4.2   Configuration

Statistics to compute are selected either with keyword arguments (Python) or a bitmask (C++). Additional configuration settings are provided through an (optional) StatisticsControl object using standard functionality provided by pex_config.

class NewStatisticsControl {
public:
    NewStatisticsControl()
            : numSigmaClip(3),
              numIter(3),
              andMask(0x0),
              noGoodPixelsMask(0x0),
              isNanSafe(true),
              calcErrorFromInputVariance(true),
              baseCaseSize(100),
              maskPropagationThresholds(16) {}

    LSST_CONTROL_FIELD(numSigmaClip, double, "Number of standard deviations to clip at");
    LSST_CONTROL_FIELD(numIter, int, "Number of iterations");
    LSST_CONTROL_FIELD(andMask, typename image::MaskPixel, "and-Mask to specify which mask planes to ignore");
    LSST_CONTROL_FIELD(noGoodPixelsMask, typename image::MaskPixel, "mask to set if no values are acceptable");
    LSST_CONTROL_FIELD(isNanSafe, bool, "Check for NaNs & Infs before running (slower)");
    LSST_CONTROL_FIELD(calcErrorFromInputVariance, bool,
                       "Calculate errors from the input variances, if available");
    LSST_CONTROL_FIELD(baseCaseSize, int, "Size of base case in partial sum for numerical stability");
    LSST_CONTROL_FIELD(
            maskPropagationThresholds, std::vector<double>,
            "Thresholds for when to propagate mask bits, treated like a dict (unset bits are set to 1.0)");
};

4.3   Results

Results are returned in the form of a single StatisticsResult object.

class StatisticsResult {
public:
    size_t getNpoint() const;
    std::pair<double, double> getMean() const;
    std::pair<double, double> getStdev() const;
    std::pair<double, double> getVariance() const;
    std::pair<double, double> getMedian() const;
    std::pair<double, double> getIqrange() const;
    std::pair<double, double> getMeanClip() const;
    std::pair<double, double> getStdevClip() const;
    std::pair<double, double> getVarianceClip() const;
    double getMin() const;
    double getMax() const;
    double getSum() const;
    typename image::MaskPixel getOrMask() const;
};

Note

Non-computed results are indicated with NaN.

Note

In Python all getters are also available as properties.

5   Implementation

The above requirements lead to the following design guidelines:

  1. Use compile-time polymorphism (through templated types) exclusively (at least in the inner loop) for abstraction, configurability and speed.
  2. Translate runtime options to compile-time constants (such that branches in the inner loop are compiled away) as much as possible.
  3. Enable pairwise summation for numerical stability.
  4. Separate accumulation of intermediate products from calculating of final values (to enable interoperability with afw::detection::SpanSet::applyFunctor).

5.1   Design overview

The general design uses a Strategy like structure but with compile-time (instead of run-time) polymorphism. It has the following components:

  • Algorithm: calculates (collect) and combines (combineWith) intermediate values and produces the final StatisticsResult (reduce). Relies on external state (ExternalData) for nested runs. Only StandardStatistics is implemented, but can be swapped out later.
  • Runner<Algorithm, Validator>: performs (operator()) a single pass run of the algorithm on the data. Holds a (unique) instance of ExternalData. Only SinglePassStatistics is implemented, but can be swapped out later.
  • Validator: checks if pixels should be included or excluded. Are composable with various simple building blocks available.

5.2   Internal API

5.2.1   computeStatistics

The externally visible computeStatistics function calls an internal detail::translateOptions, which has a series of overloads that reduce runtime function arguments to compile-time template parameters. At the end of this series this results in a call to detail::computeStatistics.

namespace detail {

template <bool useMask, bool useWeight, bool useVariance, bool computeRange, bool computeMedian,
          bool sigmaClipped, typename ImageT, typename MaskT, typename WeightT, typename VarianceT>
StatisticsResult computeStatistics(ImageT const &img, MaskT const &msk, WeightT const &wgt,
                          VarianceT const &var, NewStatisticsControl const &sctrl);

}  // namespace detail

At this point one run (or multiple runs in case of sigma clipping) of SinglePassStatistics are performed and the results are returned.

Note

Currently the code assumes StandardStatistics as the underlying algorithm, but this can be changed to a template parameter if needed.

5.2.2   SinglePassStatistics

The default runner is SinglePassStatistics. This implementation uses pairwise summation (combination) down to a baseCaseSize after which it performs a simple linear run through the data. Internally it keeps one instance of Algorithm::ExternalData which is passed to the Algorithm::collect and Algorithm::reduce functions (by reference) where needed.

It takes the following template parameters:

  • Validator: A functor that takes image, mask, weight and variance arguments and returns true if the value is to be clipped, or false otherwise.
  • Algorithm: The statistics algorithm to execute.

Constructors:

  • explicit SinglePassStatistics(const Validator &validator = Validator(), size_t baseCaseSize = 100) Copies of both input parameters are held as private members.
  • default copy and move constructors (as well as copy and move assignment operators).

Member functions and operators:

  • operator(): Run Algorithm once on provided img, msk, wgt and var using pairwise summation / combination.
template <typename ImageT, typename MaskT, typename WeightT, typename VarianceT,
          typename... AlgorithmArgs>
StatisticsResult operator()(ImageT const &img, MaskT const &msk, WeightT const &wgt,
                  VarianceT const &var);

5.2.3   StandardStatistics

The default algorithm is represented by the StandardStatistics class which has the following API

Member type:

  • ExternalData : Algorithm specific data that has to be stored across different instances of StandardStatistics, to be provided by the runner.

Template Parameters:

  • useMask, bool : Use the provided mask (assume msk is UnusedParameter if not)
  • useWeight, bool : Use the provided weights (assume wgt is UnusedParameter if not)
  • useVariance, bool : Use the provided variance (assume var is UnusedParameter if not)
  • computeRange, bool : Compute min and max values (can be split if desired)
  • computeMedian, bool : Compute the median (requires ExternalData to store a copy of good pixel values)

Constructors:

  • Default constructor, default copy and move constructors (as well as copy and move assignment operators).

Member functions:

  • collect: Gather intermediate products, can be run multiple times to accumulate one or more values.

    template <typename ImageIter, typename MaskIter, typename WeightIter, typename VarianceIter,
              typename Validator>
    void collect(ImageIter img, MaskIter msk, WeightIter wgt, VarianceIter var, size_t n,
                 Validator const &validator, ExternalData &externalData);
    
  • reduce: Compute intermediate results (e.g. mean, median, or variance, but not stdev which is computed lazily from the variance)

    StatisticsResult reduce(ExternalData &externalData);
    
  • combineWith: Combine intermediate products.

    void combineWith(const StandardStatistics &rhs);
    

5.2.4   Validators

During a run of Algorithm::collect each set of img, msk, wgt and var values is passed to the provided Validator functor. If its return value is false the pixel is clipped.

Validators are composable. The makeCombinedChecker variadic function creates a single validator out of its provided arguments.

It does this by nesting instances of CheckBoth which itself takes two template arguments First and Second and has an operator() that returns the logical and of First::operator() and Second::operator().

The following building blocks are provided:

  • AlwaysTrue: always returns true;
  • AlwaysFalse: always returns false;
  • CheckMask: returns bitwise and of msk value with mask argument provided at construction;
  • CheckFinite: returns true if img is finite (e.g. not NaN or infinity);
  • CheckRange: returns true if img is within the interval center +/- limit where center and limit are provided at construction.

5.2.5   Unused parameters

Unused parameters (for msk, wgt or var) are given as nullptr (C++) or None (Python) in the API, but for lower layers are given as instances of UnusedParameter. A fake vector / iterator that does nothing. This makes the inner loop in Algorithm::collect straightforward while compiling away into non-existence..