Meet Ollie π¦¦
βHey there, my dear friends!β Iβm Ollie the Otter , a curious river-explorer and dataβscientistβinβtraining. One morning while gliding along my favorite stream, I spotted a dairy barn where hundreds of cows were visiting automatic feeders and drinkers. The beeping RFID readers and steady munch-munch-munching created a beautiful data symphony!
Each night, back at my mossy den, I curl up with my otterβsized laptop and learn about R. The friendly farmers have shared their raw visit logs, so I can uncover the hidden tales behind my cow friends. Iβm fascinated by how every single cow writes its own story in the dataβand tonight weβre going to read their stories together using the
moo4feed
package. Ready to splash in? Letβs go!
1. Why Clean Your Data? π§Ή
Raw data from automatic feeders contains numerous quality issues including negative values,inconsistent timestamps due to daylight saving changes, and erroneous visit records. Data cleaning transforms these raw records into analysis-ready datasets suitable for behavioral research.
2. Load Required Packages π§°
library(moo4feed) # Main package for animal data processing
library(ggplot2) # For visualization
library(dplyr) # For data manipulation
The moo4feed
package includes sample datasets from UBC
Dairy Education and Research Center, allowing you to follow this
tutorial without your own data.
3. Where Are Your Files? π
# For your own data, edit this line:
# extdata_path <- "/path/to/my/raw/files" # β your folder path
# For demonstration with sample data:
extdata_path <- system.file("extdata", package = "moo4feed")
4. Load in Data files π
We use filename patterns to distinguish feeder data from drinker data:
# Find all feeder files (VR*.DAT pattern)
fileNames.f <- list.files(
path = extdata_path, # π Where to look
pattern = "^VR.*\\.DAT$", # π Pattern to match for feeder files
recursive = TRUE, # Look in subfolders too
full.names = TRUE # Include full path
) |> sort() # Keep them in order
# Find all drinker files (VW*.DAT pattern)
fileNames.w <- list.files(
path = extdata_path, # π° Where to look
pattern = "^VW.*\\.DAT$", # π° Pattern to match for drinker files
recursive = TRUE,
full.names = TRUE
) |> sort()
File Naming Requirements: All filenames must contain a six-digit date (YYMMDD) sequence. The system extracts the first six-digit sequence as the date.
Different Filename Patterns: Change the
pattern
to match your file names. For example, if your files are namedFEED_20250715.csv
, you could usepattern = "^FEED_.*\\.csv$"
.
β Valid formats:
VR200306.DAT
,Feeder_data_200306.DAT
,wat_200306_01.DAT
β Invalid:
wat_01_200306.DAT
(leading digits β01β interfere with date extraction)
5. Handle Daylight Saving Time β°
Daylight saving time (DST) transitions can create timestamp inconsistencies. Identify DST change dates for your study period:
# Identify DST transitions for your data years
dst_df <- get_dst_switch_info(
years = c(2020, 2021), # π
EDIT THIS to match your data years
tz = "America/Chicago" # EDIT THIS to your timezone if needed
# `Sys.timezone()` uses the timezone your computer is located at.
)
print(dst_df)
#> # A tibble: 2 Γ 7
#> year spring fall spring_next_day fall_next_day spring_time
#> <dbl> <date> <date> <date> <date> <dttm>
#> 1 2020 2020-03-08 2020-11-01 2020-03-09 2020-11-02 2020-03-08 01:59:00
#> 2 2021 2021-03-14 2021-11-07 2021-03-15 2021-11-08 2021-03-14 01:59:00
#> # βΉ 1 more variable: fall_time <dttm>
6. Process Data Files π
Match Files by Date
To analyze only days with both feeder AND drinker data:
# Match feeder and drinker files by date
date_compare <- compare_files(fileNames.f, fileNames.w)
fileNames.f <- date_compare$feed # Updated feeder file list
fileNames.w <- date_compare$water # Updated drinker file list
cat("Filenames for your feeder data: \n\n", fileNames.f)
#> Filenames for your feeder data:
#>
#> /home/runner/work/_temp/Library/moo4feed/extdata/VR201031.DAT /home/runner/work/_temp/Library/moo4feed/extdata/VR201101.DAT
cat("\n\n")
cat("Filenames for your drinker data: \n\n", fileNames.w)
#> Filenames for your drinker data:
#>
#> /home/runner/work/_temp/Library/moo4feed/extdata/VW201031.DAT /home/runner/work/_temp/Library/moo4feed/extdata/VW201101.DAT
# Get the overall date range of your dataset
date_result <- get_date_range(fileNames.f)
cat("Dataset date range: ", date_result, "\n")
#> Dataset date range: 201031_201101
π Configure global settings for your data
There are two ways to set up global variables for your data:
Option 1: Configure all settings at once with
set_global_cols()
This is the simplest approach - set everything in one function call:
# Configure all global variables in one call
# π EDIT THIS to match your data file structure
set_global_cols(
# Time zone
tz = "America/Vancouver",
# Column names in your data files
id_col = "cow", # animal ID column
trans_col = "transponder", # transponder ID column
start_col = "start", # visit start time column
end_col = "end", # visit end time column
bin_col = "bin", # bin ID column
dur_col = "duration", # duration column
intake_col = "intake", # intake column
start_weight_col = "start_weight", # start weight column
end_weight_col = "end_weight", # end weight column
# Bin settings
bins_feed = 1:30, # feed bin IDs in your barn
bins_wat = 1:5, # water bin IDs in your barn
bin_offset = 100 # add this to water bin IDs to differentiate from feeder IDs
)
Option 2: Configure settings individually
Alternatively, you can set each variable separately:
# ββ Time zone ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
set_tz2("America/Vancouver") # governs the handling of timestamp data
# ββ Column names in the raw feeder/drinker files ββββββββββββββββββββββββββββββ
set_id_col2("cow") # animal ID
set_trans_col2("transponder") # ID of transponder
set_start_col2("start") # visit start time
set_end_col2("end") # visit end time
set_bin_col2("bin") # feed-bin identifier
# ββ Feed and water bin settings ββββββββββββββββββββββββββββββββββββββββββββββ
set_bins_feed2(1:30) # the ID of 30 feed bins present in the barn
set_bins_wat2(1:5) # the ID of 5 water bins present in the barn
set_bin_offset2(100) # add 100 to water bin ID to differentiate from feeder ID
You can verify any setting at any time with the corresponding getter functions:
cat("Timezone:", tz2(), "\n")
#> Timezone: America/Vancouver
cat("ID column:", id_col2(), "\n")
#> ID column: cow
cat("Feed bins:", paste(bins_feed2(), collapse = ", "), "\n")
#> Feed bins: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
Process Feeder Data
# Define your column names (if your files don't have headers)
# π EDIT THIS to match your data file structure
col_names <- c("transponder", "cow", "bin", "start", "end", "duration",
"start_weight", "end_weight", "comment", "intake", "intake2",
"X1", "X2", "X3", "X4", "x5")
# Invalid IDs to remove (test tags, etc.)
# π EDIT THIS to remove any test tags or invalid IDs in your data
drop_ids <- c(0, 1556, 5015, 1111, 1112, 1113, 1114)
# Columns to retain for analysis
# π EDIT THIS to keep only the columns you need for analysis
select_cols <- c("transponder", "cow", "bin", "start", "end",
"duration", "start_weight", "end_weight", "intake")
# Process all feeder files
all_fed <- process_all_feed(
files = fileNames.f,
col_names = col_names, # Column names for your data
drop_ids = drop_ids, # IDs to remove
select_cols = select_cols, # Columns to keep (from above)
sep = ",", # π EDIT if your files use a different separator (tab="\t")
header = FALSE, # π Change to TRUE if your files have header rows
adjust_dst = TRUE # Set to FALSE if you don't want DST corrections
)
str(all_fed)
#> List of 2
#> $ 2020-10-31:'data.frame': 3483 obs. of 10 variables:
#> ..$ transponder : int [1:3483] 12200060 11954040 12706601 11954040 12200060 12200060 12199974 12706601 12448407 12706601 ...
#> ..$ cow : int [1:3483] 5114 4070 7010 4070 5114 5114 5028 7010 6020 7010 ...
#> ..$ bin : int [1:3483] 13 24 29 23 17 15 12 30 5 30 ...
#> ..$ start : POSIXct[1:3483], format: "2020-10-31 00:03:24" "2020-10-31 00:05:32" ...
#> ..$ end : POSIXct[1:3483], format: "2020-10-31 00:04:01" "2020-10-31 00:09:25" ...
#> ..$ duration : int [1:3483] 37 233 272 130 513 6 509 307 180 343 ...
#> ..$ start_weight: num [1:3483] 18.7 8.6 11.9 9.2 14.1 23 32.6 15.1 18 13.9 ...
#> ..$ end_weight : num [1:3483] 19.4 7.4 10.9 8.4 12.3 23 32.6 13.9 17.4 12.6 ...
#> ..$ intake : num [1:3483] -0.7 1.2 1 0.8 1.8 0 0 1.2 0.6 1.3 ...
#> ..$ date : Date[1:3483], format: "2020-10-31" "2020-10-31" ...
#> $ 2020-11-01:'data.frame': 3646 obs. of 10 variables:
#> ..$ transponder : int [1:3646] 12706613 12706634 12448477 21268732 12706613 12448477 12706601 12448477 12706613 12706634 ...
#> ..$ cow : int [1:3646] 7022 7043 6090 6126 7022 6090 7010 6090 7022 7043 ...
#> ..$ bin : int [1:3646] 13 29 15 6 29 14 30 15 29 28 ...
#> ..$ start : POSIXct[1:3646], format: "2020-11-01 00:04:26" "2020-11-01 00:03:57" ...
#> ..$ end : POSIXct[1:3646], format: "2020-11-01 00:06:03" "2020-11-01 00:08:47" ...
#> ..$ duration : int [1:3646] 97 290 142 393 155 155 366 13 71 230 ...
#> ..$ start_weight: num [1:3646] 12 9.3 23.8 5 8.2 15.9 20.2 23.5 7.8 8.4 ...
#> ..$ end_weight : num [1:3646] 12.1 8.2 23.5 3.8 7.8 15.3 19.1 23.5 7.6 7.5 ...
#> ..$ intake : num [1:3646] -0.1 1.1 0.3 1.2 0.4 0.6 1.1 0 0.2 0.9 ...
#> ..$ date : Date[1:3646], format: "2020-11-01" "2020-11-01" ...
Geographic Limitation π: Currently this function only adjust DST automatically for North American time zones because we donβt have data from other parts of the world. If you are from a different timezone and are having trouble handling DST changes yourself, please contact the maintainer and Iβm happy to add support for your timezone!
Notice when we call the function process_all_feed()
, we
didnβt specify id_col
,
start_col
, tz
etcβtheyβre all filled in
automatically from the global settings you defined above. If you ever
need a one-off override (for example, a different time-zone just for
this call) you can still pass that argument directly:
process_all_feed(files = fileNames.f, tz = "UTC")
Process Water Data
# Column names (only needed because drinker files have no headers)
col_names_wat <- c(
"transponder", "cow", "bin", "start", "end",
"duration", "start_weight", "end_weight", "intake"
)
select_cols_wat <- c(
"transponder", "cow", "bin", "start", "end",
"duration", "start_weight", "end_weight", "intake"
)
# Process all water files
all_wat <- process_all_water(
files = fileNames.w,
col_names = col_names_wat,
drop_ids = drop_ids,
select_cols = select_cols_wat,
sep = ",",
header = FALSE,
adjust_dst = TRUE # Disable if you don't need DST correction
)
str(all_wat)
#> List of 2
#> $ 2020-10-31:'data.frame': 572 obs. of 10 variables:
#> ..$ transponder : int [1:572] 11954040 12448407 12448407 12200060 12706609 12706618 11953971 11953971 11953971 12448417 ...
#> ..$ cow : int [1:572] 4070 6020 6020 5114 7018 7027 4001 4001 4001 6030 ...
#> ..$ bin : num [1:572] 104 102 102 101 104 101 104 104 104 102 ...
#> ..$ start : POSIXct[1:572], format: "2020-10-31 00:03:26" "2020-10-31 00:10:49" ...
#> ..$ end : POSIXct[1:572], format: "2020-10-31 00:04:16" "2020-10-31 00:12:16" ...
#> ..$ duration : int [1:572] 50 87 44 50 79 69 14 19 22 103 ...
#> ..$ start_weight: num [1:572] 43.4 43.9 44 41.2 43.4 41.3 43.5 43.4 43.1 43.9 ...
#> ..$ end_weight : num [1:572] 26.9 24.3 34.1 36.4 30.4 26.4 43.5 43.1 43 27.9 ...
#> ..$ intake : num [1:572] 16.5 19.6 9.9 4.8 13 14.9 0 0.3 0.1 16 ...
#> ..$ date : Date[1:572], format: "2020-10-31" "2020-10-31" ...
#> $ 2020-11-01:'data.frame': 566 obs. of 10 variables:
#> ..$ transponder : int [1:566] 12448407 12706601 12706613 12200070 12199987 12706634 12448420 12448477 12448420 12448477 ...
#> ..$ cow : int [1:566] 6020 7010 7022 5124 5041 7043 6033 6090 6033 6090 ...
#> ..$ bin : num [1:566] 104 105 104 102 104 105 104 102 104 102 ...
#> ..$ start : POSIXct[1:566], format: "2020-11-01 00:03:19" "2020-11-01 00:03:39" ...
#> ..$ end : POSIXct[1:566], format: "2020-11-01 00:04:23" "2020-11-01 00:06:13" ...
#> ..$ duration : int [1:566] 64 154 120 307 322 60 33 96 67 92 ...
#> ..$ start_weight: num [1:566] 43.5 40.9 43.2 43.9 43.3 41 43.6 43.9 42.4 44 ...
#> ..$ end_weight : num [1:566] 40.2 22.4 28.5 38.2 22.6 26.8 42.4 33.5 35.6 39.8 ...
#> ..$ intake : num [1:566] 3.3 18.5 14.7 5.7 20.7 14.2 1.2 10.4 6.8 4.2 ...
#> ..$ date : Date[1:566], format: "2020-11-01" "2020-11-01" ...
7. Quality Check Your Data π΅οΈββοΈ
Quality control identifies and corrects common data issues including overlapping visits, impossible values, and missing records.
Configure Quality Check Parameters
# Define quality control thresholds
# π EDIT the thresholds to match your data and preferences
my_qc_config <- qc_config(
# Feed thresholds
high_dur_feed = 2000, # Flag feed visits longer than 2000 seconds
large_intake_visit_feed = 8, # Flag feed visits with more than 8kg intake
large_intake_rate_feed = 0.008, # Flag feed intake rates above 0.008 kg/s
low_feed_intake = 35, # Flag cows eating less than 35kg/day
high_feed_intake = 75, # Flag cows eating more than 75kg/day
# Water thresholds
high_dur_water = 1800, # Flag water visits longer than 1800 seconds
large_intake_visit_water = 30, # Flag water visits with more than 30L intake
large_intake_rate_water = 0.35, # Flag water intake rates above 0.35 L/s
low_wat_intake = 60, # Flag cows drinking less than 60L/day
high_wat_intake = 180, # Flag cows drinking more than 180L/day
# General settings
low_visit_threshold = 10, # Flag bins with fewer than 10 visits/day
total_cows_expected = 48, # Set to your herd size
replacement_threshold = 26, # Time gap (s) to classify replacement behavior
calibration_error = 0.5 # Feeder/drinkercalibration error (kg/L)
)
Run the quality check
# Run comprehensive quality check
qc_results <- qc(
feed = all_fed, # Your processed feeder data
water = all_wat, # Your processed drinker data
cfg = my_qc_config, # Your custom thresholds
verbose = FALSE, # Do not show detailed warning messages about errorous visits, set to TRUE to see them
fix_double_detections = TRUE # Automatically fix overlapping visits
)
# Store warnings for review
warning <- qc_results$warnings
The qc()
function performs multiple validation
steps:
- Combined feed and water data for analysis
- Checked if we have the expected number of cows each day
- Flagged double detections (when the same cow is detected at different bins at the same time) and fixed them
- Flagged negative values in duration, intake, or weight measurements
- Removed records with negative durations or intakes
- Identified abnormally large feed/water intakes
- Flags cows that did not feed or drink after noon (potentially lost ear tag or due to illness)
- Flagged bins with very low traffic or no visits at all (potentially bin malfunction)
8. Traditional vs.Β Advanced Outlier Detection π
Traditional 3-Standard Deviation Method
Historically, dairy research has used rigid statistical cutoffs to remove outliers. The traditional approach removes all data points beyond 3 standard deviations from the mean:
# Demonstrate traditional outlier removal on feed data
demo_feed <- do.call(rbind, qc_feed) # Combine all days into one dataframe
# Calculate traditional 3-SD cutoffs for intake
intake_mean <- mean(demo_feed$intake, na.rm = TRUE)
intake_sd <- sd(demo_feed$intake, na.rm = TRUE)
intake_lower <- max(0, (intake_mean - 3 * intake_sd)) # set lower bound to 0
intake_upper <- intake_mean + 3 * intake_sd
# Calculate traditional 3-SD cutoffs for duration
dur_mean <- mean(demo_feed$duration, na.rm = TRUE)
dur_sd <- sd(demo_feed$duration, na.rm = TRUE)
dur_lower <- max(0, (dur_mean - 3 * dur_sd)) # set lower bound to 0
dur_upper <- dur_mean + 3 * dur_sd
# Flag outliers using traditional method
demo_feed$outlier <- "N"
demo_feed$outlier[demo_feed$intake < intake_lower |
demo_feed$intake > intake_upper |
demo_feed$duration < dur_lower |
demo_feed$duration > dur_upper] <- "Y"
# Visualize traditional outlier detection
p_traditional <- viz_outliers(
demo_feed,
x_var = "intake",
y_var = "duration",
title = "Traditional Outlier Detection (3 SD from mean)",
jitter_amount = 0,
regular_color = "lightgreen",
outlier_color = "orange"
)
print(p_traditional)
Limitations of the Traditional 3-SD Method
The standard approach using a 3-standard-deviation (3-SD) cutoff presents several challenges:
- Overly strict thresholds: While effective at removing extreme outliers, this method imposes rigid, rectangular cutoff zones that may not align with biological rationales.
- Excessive conservatism: The approach may discard biologically reasonable extremes. Previous validation study showed that long feeding visit accompanied by high feed intake is possible, but they dare considered data errors using 3-SD method. In the meanwhile, itβs not likely for a cow to consume 3kg in 5 seconds, but those will not be labeled as outliers using 3-SD method.
- Misclassification of valid data: Many feeding events that appear normal in the context of cow behavior are incorrectly flagged as outliers simply because they fall outside the 3-SD range. Excluding these points can significantly distort estimates of individual animalβs daily intake and feeding duration.
- Misses multivariate relationships: The method evaluates each variable independently, failing to consider the relationship between intake, duration, and feeding rate.
- Ignores differences among farms and animals: Everyoneβs data is different. The method does not account for variations in feeding behavior across different farms or individual cows.
9. Advanced KNN Outlier Detection π¬
K-Nearest Neighbors (KNN) outlier detection evaluates data points in multidimensional space, considering relationships between intake, duration, and rate simultaneously. The KNN method looks at each data point and compares it to its neighbors. If a point is very different from its neighbors (in the 3-demensional space of intake, duration, and rate), it might be an outlier.
KNN Method Advantages
- Multidimensional analysis: Considers the intake, duration, and rate in one go
- Flexible boundaries: You can change the boundaries to fit YOUR data
- Relationship-aware: Preserves correlations between variables
# Apply KNN outlier detection
feed_with_outliers <- knn_clean_feed(
qc_feed, # Our cleaned feed data
k = 50, # Number of neighbors to consider
threshold_percentile = 99.3, # Threshold for outlier detection
# Custom scaling helps balance the influence of each variable
custom_scaling = list(
rate = 10000, # Scaling for intake rate
intake = 7, # Scaling for intake amount
duration = 0.03 # Scaling for duration
),
remove_outliers = FALSE # Set to TRUE if you wish to remove outliers
)
# Visualize KNN outlier detection
p_knn <- viz_outliers(
feed_with_outliers,
x_var = "rate",
y_var = "intake",
title = "Feed Outliers: Rate vs Intake",
jitter_amount = 0,
regular_color = "lightgreen",
outlier_color = "orange"
)
print(p_knn)
Ollieβs Tips for KNN Outlier Detection: π¦¦
- Experiment with settings: Try different values for
k
,threshold_percentile
, andcustom_scaling
to see what works best for your data- Visualize before removing: Always use
viz_outliers()
to check if the detected outliers make sense before removing them- Custom scaling matters: Adjusting the scaling values helps balance the influence of different variables and adjusting for variables at different scales:
- e.g., duration is in tens to hundreds of seconds, but rate is often < 1, so rate needs to be scaled up in order to be comparable to duration
- Increase scaling for variables that should have more influence
- Decrease scaling for variables that should have less influence
- Why these scaling values?: These values were chosen to be conservative and not remove too many data points. From previous experience, itβs likely for a cow to have very long durations per visit, but very unlikely to have large intake in a short time, so we flagged outliers to catch visits with large intake and high rate.
- Be cautious: Donβt set
remove_outliers = TRUE
until youβre confident the detection is working correctly
Important Note: The 2D visualization above only shows two dimensions (duration vs.Β intake), but KNN analysis operates in 3D space using: Intake (kg), Duration (seconds), and Rate (kg/s). Some points that appear normal in 2D may be outliers when considering all three dimensions.
Apply KNN to Water Data
# KNN outlier detection for water data
water_with_outliers <- knn_clean_water(
qc_water, # Our cleaned water data
k = 50, # Number of neighbors to consider
threshold_percentile = 99, # Threshold for outlier detection
# Water data often needs different scaling
custom_scaling = list(
rate = 2000, # Scaling for intake rate
intake = 7, # Scaling for intake amount
duration = 0.01 # Scaling for duration
),
remove_outliers = FALSE # Set to TRUE if you wish to remove outliers
)
# Let's also look at rate vs intake
p2 <- viz_outliers(
water_with_outliers,
x_var = "rate",
y_var = "intake",
title = "Water Outliers: Rate vs Intake",
jitter_amount = 0,
regular_color = "lightblue",
outlier_color = "orange"
)
print(p2)
Remove Outliers
Once youβre satisfied with the outlier detection, you can choose to remove them:
# Remove outliers from feed data
cleaned_feed_no_outliers <- knn_clean_feed(
qc_feed,
k = 50,
threshold_percentile = 99.3,
custom_scaling = list(rate = 10000, intake = 7, duration = 0.03),
remove_outliers = TRUE
)
# Remove outliers from water data
cleaned_water_no_outliers <- knn_clean_water(
qc_water,
k = 50,
threshold_percentile = 99.8,
custom_scaling = list(rate = 2000, intake = 1, duration = 0.01),
remove_outliers = TRUE
)
# Create final cleaned datasets
clean_feed <- cleaned_feed_no_outliers
clean_water <- cleaned_water_no_outliers
clean_comb <- combine_feed_water(clean_feed, clean_water)
10. Daily Intake Summaries π
# Generate daily intake summaries
daily_summary <- feed_water_summary(
feed = qc_feed, # Your cleaned feed data
water = qc_water, # Your cleaned water data
warn = warning, # Warning dataframe to update
cfg = my_qc_config # Quality check configuration
)
# Inspect the summary results
summary_df <- daily_summary$summary
head(summary_df)
#> # A tibble: 6 Γ 8
#> date cow feed_intake feed_duration feed_visits water_intake
#> <date> <int> <dbl> <dbl> <int> <dbl>
#> 1 2020-10-31 2074 60.9 11163 46 105.
#> 2 2020-10-31 3150 59.2 17618 52 116.
#> 3 2020-10-31 4001 53.3 8099 55 75
#> 4 2020-10-31 4044 63.4 12038 73 125.
#> 5 2020-10-31 4070 63.3 12530 59 123.
#> 6 2020-10-31 4072 54.7 13679 79 99.5
#> # βΉ 2 more variables: water_duration <dbl>, water_visits <int>
# Look at the updated warnings that include intake anomalies
warning <- daily_summary$warn
head(warning)
#> # A tibble: 2 Γ 16
#> date total_cows missing_cow double_detection_bins negative_visit_bins
#> <chr> <int> <chr> <chr> <chr>
#> 1 2020-10-31 47 Yes 30 6; 12; 13
#> 2 2020-11-01 47 Yes 3; 5; 6; 18 6; 12; 13; 30
#> # βΉ 11 more variables: cows_disappeared_after_noon <chr>,
#> # bins_never_visited <chr>, bins_low_traffic <chr>, long_dur_feeder <chr>,
#> # large_intake_feed_visit <chr>, low_daily_feed_intake_cows <chr>,
#> # high_daily_feed_intake_cows <chr>, long_dur_drinker <chr>,
#> # large_intake_water_visit <chr>, low_daily_water_intake_cows <chr>,
#> # high_daily_water_intake_cows <chr>
11. Summary
This tutorial demonstrated comprehensive data cleaning for automatic feeder data:
β File processing: Handled multiple data files with date matching
β Quality control: Identified and corrected common data issues
β Advanced outlier detection: Compared traditional vs.Β KNN methods
β Daily summaries: Created analysis-ready intake metrics
The cleaned datasets are now ready for behavioral analysis as demonstrated in subsequent tutorials.
12. Code cheatsheet π
# === 1. Load packages ===
library(moo4feed)
# === 2. Set paths to data files ===
# For your own data:
# extdata_path <- "/path/to/my/raw/files"
# Or for the demo data:
extdata_path <- system.file("extdata", package = "moo4feed")
# If feeders and drinkers are in different folders:
# extdata_path_f <- "/path/to/feeder/files"
# extdata_path_w <- "/path/to/drinker/files"
# === 3. Find data files ===
# Find all feeder files (VR*.DAT)
fileNames.f <- list.files(
path = extdata_path,
pattern = "^VR.*\\.DAT$",
recursive = TRUE,
full.names = TRUE
) |> sort()
# Find all drinker files (VW*.DAT)
fileNames.w <- list.files(
path = extdata_path,
pattern = "^VW.*\\.DAT$",
recursive = TRUE,
full.names = TRUE
) |> sort()
# === 4. Handle daylight saving time ===
dst_df <- get_dst_switch_info(years = c(2020, 2021), tz = "America/Chicago")
# === 5. Match files by date ===
date_compare <- compare_files(fileNames.f, fileNames.w)
fileNames.f <- date_compare$feed
fileNames.w <- date_compare$water
# Get the overall date range
date_result <- get_date_range(fileNames.f)
# === 6. Configure global settings ===
# Option 1: All at once
set_global_cols(
# Time zone
tz = "America/Vancouver",
# Column names in your data files
id_col = "cow",
trans_col = "transponder",
start_col = "start",
end_col = "end",
bin_col = "bin",
dur_col = "duration",
intake_col = "intake",
start_weight_col = "start_weight",
end_weight_col = "end_weight",
# Bin settings
bins_feed = 1:30,
bins_wat = 1:5,
bin_offset = 100
)
# Option 2: Individual settings (alternative)
# set_tz2("America/Vancouver")
# set_id_col2("cow")
# set_trans_col2("transponder")
# set_start_col2("start")
# set_end_col2("end")
# set_bin_col2("bin")
# set_bins_feed2(1:30)
# set_bins_wat2(1:5)
# set_bin_offset2(100)
# === 7. Process feeder data ===
# Define column names for feeder files
col_names <- c("transponder", "cow", "bin", "start", "end", "duration",
"start_weight", "end_weight", "comment", "intake", "intake2",
"X1", "X2", "X3", "X4", "x5")
# IDs to remove
drop_ids <- c(0, 1556, 5015, 1111, 1112, 1113, 1114)
# Columns to keep
select_cols <- c("transponder", "cow", "bin", "start", "end",
"duration", "start_weight", "end_weight", "intake")
# Process all feeder files
all_fed <- process_all_feed(
files = fileNames.f,
col_names = col_names,
drop_ids = drop_ids,
select_cols = select_cols,
sep = ",",
header = FALSE,
adjust_dst = TRUE
)
# === 8. Process water data ===
# Column names for water files
col_names_wat <- c(
"transponder", "cow", "bin", "start", "end",
"duration", "start_weight", "end_weight", "intake"
)
# Use same IDs to remove
drop_ids <- c(0, 1556, 5015, 1111:1114)
# Columns to keep
select_cols_wat <- c(
"transponder", "cow", "bin", "start", "end",
"duration", "start_weight", "end_weight", "intake"
)
# Process all water files
all_wat <- process_all_water(
files = fileNames.w,
col_names = col_names_wat,
drop_ids = drop_ids,
select_cols = select_cols_wat,
sep = ",",
header = FALSE,
adjust_dst = TRUE
)
# === 9. Quality check configuration ===
my_qc_config <- qc_config(
# Feed thresholds
high_dur_feed = 2000,
large_intake_visit_feed = 8,
large_intake_rate_feed = 0.008,
low_feed_intake = 35,
high_feed_intake = 75,
# Water thresholds
high_dur_water = 1800,
large_intake_visit_water = 30,
large_intake_rate_water = 0.35,
low_wat_intake = 60,
high_wat_intake = 180,
# General settings
low_visit_threshold = 10,
total_cows_expected = 48,
replacement_threshold = 26,
calibration_error = 0.5
)
# === 10. Run quality check ===
qc_results <- qc(
feed = all_fed,
water = all_wat,
cfg = my_qc_config,
id_col = id_col2(), # Animal ID column (default from global vars)
start_col = start_col2(), # Visit start time column (default from global vars)
end_col = end_col2(), # Visit end time column (default from global vars)
bin_col = bin_col2(), # Bin/feeder ID column (default from global vars)
dur_col = duration_col2(), # Visit duration column (default from global vars)
intake_col = intake_col2(), # Intake amount column (default from global vars)
start_weight_col = start_weight_col2(), # Start weight column (default from global vars)
end_weight_col = end_weight_col2(), # End weight column (default from global vars)
tz = tz2(), # Timezone (default from global vars)
bins_feed = bins_feed2(), # Valid feed bin numbers (default from global vars)
bins_wat = bins_wat2(), # Valid water bin numbers (default from global vars)
bin_offset = bin_offset2(), # Bin offset for water bins (default from global vars)
verbose = FALSE,
fix_double_detections = TRUE
)
# Store warnings
warning <- qc_results$warnings
# === 11. Extract cleaned data ===
qc_feed <- qc_results$feed
qc_water <- qc_results$water
qc_combined <- qc_results$combined
# === 12. KNN outlier detection ===
# Detect outliers in feed data
feed_with_outliers <- knn_clean_feed(
qc_feed,
k = 50,
threshold_percentile = 99.3,
custom_scaling = list(
rate = 10000,
intake = 7,
duration = 0.03
),
intake_col = intake_col2(), # Intake amount column (default from global vars)
duration_col = duration_col2(), # Visit duration column (default from global vars)
date_col = "date", # Date column name
remove_outliers = FALSE
)
# Visualize feed outliers
p1 <- viz_outliers(
feed_with_outliers,
x_var = "duration",
y_var = "intake",
title = "Feed Outliers: Duration vs Intake",
jitter_amount = 0.2,
regular_color = "lightgreen",
outlier_color = "orange"
)
print(p1)
p2 <- viz_outliers(
feed_with_outliers,
x_var = "rate",
y_var = "intake",
title = "Feed Outliers: Rate vs Intake",
jitter_amount = 0,
regular_color = "lightgreen",
outlier_color = "orange"
)
print(p2)
# Detect outliers in water data
water_with_outliers <- knn_clean_water(
qc_water,
k = 50,
threshold_percentile = 99,
custom_scaling = list(
rate = 2000,
intake = 7,
duration = 0.01
),
intake_col = intake_col2(), # Intake amount column (default from global vars)
duration_col = duration_col2(), # Visit duration column (default from global vars)
date_col = "date", # Date column name
remove_outliers = FALSE
)
# Visualize water outliers
p3 <- viz_outliers(
water_with_outliers,
x_var = "duration",
y_var = "intake",
title = "Water Outliers: Duration vs Intake",
regular_color = "lightblue",
outlier_color = "orange"
)
print(p3)
p4 <- viz_outliers(
water_with_outliers,
x_var = "rate",
y_var = "intake",
title = "Water Outliers: Rate vs Intake",
jitter_amount = 0,
regular_color = "lightblue",
outlier_color = "orange"
)
print(p4)
# Remove outliers
cleaned_feed_no_outliers <- knn_clean_feed(
qc_feed,
k = 50,
threshold_percentile = 99.3,
custom_scaling = list(rate = 10000, intake = 7, duration = 0.03),
intake_col = intake_col2(), # Intake amount column (default from global vars)
duration_col = duration_col2(), # Visit duration column (default from global vars)
date_col = "date", # Date column name
remove_outliers = TRUE
)
# Same for water data
cleaned_water_no_outliers <- knn_clean_water(
qc_water,
k = 50,
threshold_percentile = 99,
custom_scaling = list(rate = 2000, intake = 7, duration = 0.01),
intake_col = intake_col2(), # Intake amount column (default from global vars)
duration_col = duration_col2(), # Visit duration column (default from global vars)
date_col = "date", # Date column name
remove_outliers = TRUE
)
# Set final cleaned datasets
clean_feed <- cleaned_feed_no_outliers
clean_water <- cleaned_water_no_outliers
clean_comb <- combine_feed_water(clean_feed, clean_water)
# === 13. Create daily summaries ===
daily_summary <- feed_water_summary(
feed = qc_feed,
water = qc_water,
warn = warning,
cfg = my_qc_config,
id_col = id_col2(), # Animal ID column (default from global vars)
intake_col = intake_col2(), # Intake amount column (default from global vars)
dur_col = duration_col2() # Visit duration column (default from global vars)
)
# Get summary dataframe
summary_df <- daily_summary$summary
# Get updated warnings
warning <- daily_summary$warn