
Detect outliers using k-nearest neighbors (KNN) method
Source:R/knn_outliers.R
knn_outlier_detection.Rd
This function identifies outliers in feeding or drinking data using the K-nearest neighbors (KNN) algorithm. It's based on the idea that outliers will have larger average distances to their k-nearest neighbors.
Usage
knn_outlier_detection(
df,
k = 50,
threshold_percentile = 99,
custom_scaling = NULL,
intake_col = intake_col2(),
duration_col = duration_col2(),
remove_outliers = FALSE
)
Arguments
- df
A data frame containing feeding or drinking data.
- k
Integer. Number of nearest neighbors to consider (default: 50). Will be automatically adjusted if it exceeds the number of rows in the data.
- threshold_percentile
Numeric. Percentile threshold for outlier detection. Points with average distances above this percentile are considered outliers. Must be between 0 and 100. Default is 99.
- custom_scaling
A named list with scaling factors for input variables. Default is NULL, which means no scaling is applied (all factors = 1).
- intake_col
Character. Name of the column containing intake data (default: from global_var.R).
- duration_col
Character. Name of the column containing duration data (default: from global_var.R).
- remove_outliers
Logical. Whether to remove outliers from the data frame.
Value
A data frame with the same structure as the input, with an additional column 'outlier' indicating whether each row is an outlier ("Y") or not ("N"). If remove_outliers=TRUE, returns the data frame with outliers removed and the outlier column dropped.
Examples
# Detect outliers in feeding data
cleaned_feed <- knn_outlier_detection(all_fed[[1]], threshold_percentile = 99.936)
#> Warning: NAs or Inf values detected in rate calculation. These rows will be automatically marked as outliers.
cleaned_feed[which(cleaned_feed$outlier == "Y"), ]
#> transponder cow bin start end duration
#> 748 12200070 5124 1 2020-10-31 07:36:44 2020-10-31 08:06:42 1798
#> 1517 11724250 3150 14 2020-10-31 11:21:38 2020-10-31 11:51:05 1767
#> 1761 12706613 7022 26 2020-10-31 13:12:53 2020-10-31 13:12:53 0
#> 3367 12200069 5123 14 2020-10-31 22:12:23 2020-10-31 22:41:59 1776
#> start_weight end_weight intake date rate outlier
#> 748 30.7 25.5 5.2 2020-10-31 0.002892102 Y
#> 1517 14.4 9.0 5.4 2020-10-31 0.003056027 Y
#> 1761 26.8 26.9 -0.1 2020-10-31 -Inf Y
#> 3367 34.1 27.6 6.5 2020-10-31 0.003659910 Y