Skip to content

Clarification regarding Shingle size, number of samples per tree and threshold #365

@prathamk

Description

@prathamk

Hello,

We are trying to understand the implementation of RRCF and have a few queries regarding tuning some of the hyper-parameters. Kindly help us in better understanding, the effects of the following hyper-parameters.

Shingle Size:
With shingle size we generally try to capture the shape of the curve which is expected to be repeated periodically.
For example, if we have data for every 15 min with pattern repeating every 24 hrs., then shingle size should be (60/15) * 24 = 96.
And if we have data for every 5 min. with pattern repeating every 24 hrs., then shingle size should be (60/5) * 24 = 288.

Is this the right way to decide on the shingle size?
As by increasing the shingle size, there is the issue for delayed detection. Also some peaks and dips for shorter duration are not getting detected with the default settings for threshold and anomaly rate. Seems like we need to decrease the lower threshold also as we increase the shingle size. Is this assumption correct?

In the below snapshot, data is collected every 5 min. and shingle size = 288, number Of Trees = 100, sample Size = 1024.
Plot 2
In the above plot, shingle size of 288 (capturing the shape of the curve), but it seems miss a lot of anomalies. The score increases, but doesn't cross the lower threshold.

In the below snapshot, data is collected every 5 min. and shingle size = 16, number Of Trees = 100, sample Size = 1024.
Plot 3
In the above plot, shingle size of 16 is chosen at random (it doesn't capture the shape of the entire curve), but it seems to better capture the anomalies. Also delay is less in identification of anomalies.

Also keeping this high value for shingle size, would increase the dimensions of the point. Is there any limitation on the size of the shingle?

Number of samples:
Is there any recommendation on how many samples one should keep in a tree of the forest? As currently we are randomly choosing the value of number of samples in a tree.

Although as I understand, increasing the sample size provides each tree a larger view of the data, but also increases the running time and size of the state of the forest.

Lower Threshold:
The lower threshold is initialized with the value of 1.1. How is this value chosen for the lower threshold and is it recommended to modify this value? What are the factors one need to consider while modifying the lower threshold?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions