#29 closed defect (fixed)
psArrayStats() questions
| Reported by: | Owned by: | eugene | |
|---|---|---|---|
| Priority: | high | Milestone: | |
| Component: | PSLib ADD | Version: | unspecified |
| Severity: | normal | Keywords: | |
| Cc: |
Description
1) The ADD for the robusts statictis states that we should "fit a Guassian" to
the bins of the smoothed robust histogram. Do you have any preferences over
which algorithm we should use to make the fit?
2) The ADD also states that the sample LQ and sample UQ should be used in
determining the bin size. The quantites require a full sort of the data and
this might be inefficient for large data sets (as you know). Should we use
robust quartile points if the data set is large?
3) What is the algorithm for determing robust quartile points?
Change History (4)
comment:1 by , 22 years ago
comment:2 by , 22 years ago
| Component: | SDRS - PSLib → ADD |
|---|---|
| Status: | new → assigned |
1) fit a gaussian: the ADD gives some guides on doing fitting with
Levenberg-Marquardt, and psMinimize / psMinimizeChi are supposed to aid with
that. Those functions in the ADD and SDRS may be a bit unclear still.
2) large samples / sample stats: the ADD says that the sample median and
quartile should be avoided for large number datasets. What should psArrayStats
do if these are requested for a large dataset? Perhaps they should just return
the robust value instead at the elements .sampleMedian. For rigor, it would be
nice if the parameter Nlarge could be set dynamically. Perhaps a value in the
structure?
3) robust quartile pts should be found analogously to the robust median: fitting
the points in the cumulative histogram in the vicinity of 25% and 75% and
interpolating. I'll add it to the ADD.
4) Hmm, a good question. What is meaningful? One option would be the bin at
which the median is found, but that would not give a good idea of how many
points were relevant. I would say to use the value N_75 - N_25 from the
cumulative histogram, but you don't always construct it. TBD
5) cumulative histogram should be constructed from the un-smoothed version
6) here, and in the 25/75 points, the three bins are: the bin which contains the
point (median/UQ/LQ), N, and its two neighbors, N-1 & N+1
I will add these points to the ADD and SDRS and try to decide on 4 and the issue
in 2.
comment:3 by , 22 years ago
1) Okay.
2) Hmmm. Here was my interpretation. When the user requests SAMPLE quartiles,
then the data vector is always sorted and the exact quartile points are
calculated. When the user requests ROBUST quartiles, then the robust methods
for calculating them are used. The statement "sample quartiles should be
avoided for samples which are large (N >104) ..." was interpreted, by me, as
the user should avoided requesting SAMPLE quartiles with large data, though
psArrayStats() will gladly calculate them if requested.
Maybe we should only have one set of PS_LQ|PS_UQ and let the software decide
which is the appropriate method to use.
- I'm still unclear. When you say "bin", what numerical value do you associate
with it? The lower endpoint, the upper endpoint, the middle point?
comment:4 by , 22 years ago
| Resolution: | → fixed |
|---|---|
| Status: | assigned → closed |
regarding point (4), I have changed the list of returned values to remove the
three entries (robustMeanNvalue, robustMedianNvalue, robustModeNvalue) and have
replaced them with robustN50 and robustNfit which should always be returned when
robust stats are calculated. The first gives the number of points in the range
LQ - UQ. The second gives the number of point in the bins used to calculate the
Gaussian fit.
regarding point (2), when do we choose to use the robust stats vs the sample
stats? psStats now has an additional entry, .sampleLimit and psStatsOptions has
an additional entry PS_STATS_ROBUST_FOR_SAMPLE. If the number of data points in
the input vector is larger than .sampleLimit, the sample stats should be
calculated using the robust algorithm and the PS_STATS_ROBUST_FOR_SAMPLE bit
should be set. The default value for .sampleLimit should be 3e5. This is now
discussed in the SDRS.
also regarding point (2), how do we choose the bin size? I have made some
adjustments to the ADD to clarify this. Basically, I have chnaged it to use the
clipped Stdev rather than the range UQ-LQ to get the choice of bin size. I also
extended the range of dL.

4) How are the robust mean|mode|median nvalues defined?
5) In calculating the robust median, the ADD states "construct a cumulative
histogram from the histogram above". Does "histogram above" refer to the
original histogram that built from the data set? Or does it refer to that
original histogram after it has been Gaussian smmothed?
6) ADD states "Fit a quadratic to these three points". What three points are
you referring to? Can I assume you mean the median point of the bins surround
the 50th percentile value (ie. (bin_upper_bound - bin_lower_bound)/2))?