As explained, the clustering method used in package
evprof is Gaussian Mixture Models clustering. This method
is sensible to outliers since it tries to explain as most as possible
all the variance of the data, which results to wide and non-specific
Gaussian distributions (clusters). Therefore evprof
package provides different functions to detect and filter outliers. At
the same time, it is also recommended to perform the clustering process
in a logarithmic scale, to include negative values to originally
positive variables. The logarithmic transformation can be done in most
of functions, setting the log argument to
TRUE.
Here we have a set of sessions of example,
noisy_set:
noisy_set <- sessions_divided %>% # Obtained from the "Get started" article
filter(Disconnection == "Home", Timecycle == "Friday") # Friday Home
plot_points(noisy_set, size = 0.2)
We set the start parameter at 06:00:
options(
evprof.start.hour = 6
)And we can plot it in logarithmic scale to visualize the areas of the plot where outliers are:
plot_points(noisy_set, size = 0.2, log = T)
Cutting sessions
If we see a part of the graph that consists clearly of outlying
points, then we can cut directly the sessions below or above this
specific limit using the function cut_sessions(). This
function lets to configure the Connection Duration limits
(connection_hours_min and
connection_hours_max) and the Connection Start limits
(connection_start_min and
connection_start_max). If we want to make the division in
logarithmic scale it is important to set the argument
log = TRUE.
noisy_set <- noisy_set %>%
cut_sessions(connection_hours_min = 1.5, connection_start_min = 2.5, log = T)
plot_points(noisy_set, size = 0.2, log = T)
Noise cleaning with DBSCAN clustering
The DBSCAN (Density-based spatial clustering of
applications with noise) clustering method is widely used for dividing
data sets according to density zones. In this case, this method has been
used to detect the outliers, i.e the data points outside of the main
density zones. Package evprof proposes the function
detect_outliers with the purpose of classify a certain
noise threshold of noise. The main arguments of this function
are MinPts, eps (DBSCAN parameters) and
noise_th (noise threshold, in %). The function
detect_outliers allows you to configure just the
MinPts and noise_th to automatically find the
eps value.
Usually values around of MinPts = 200 and
noise_th = 2 are recommended, but you could configure an
iteration to find the best combination according to the plot obtained
with function plot_outliers. First, let’s create a table
with all combinations of MinPts and noise_th
values you want:
.MinPts <- c(10, 50, 100, 200)
.noise_th <- c(1, 3, 5, 7)
dbscan_params <- tibble(
MinPts = rep(.MinPts, each = length(.noise_th)),
noise_th = rep(.noise_th, times = length(.MinPts))
)
print(dbscan_params)Now let’s run the iteration to create a plot for every combination
(using purrr::pmap function):
plots_list <- pmap(
dbscan_params,
~ noisy_set %>%
detect_outliers(MinPts = ..1, noise_th = ..2, log = T) %>%
plot_outliers(log = T, size = 0.2) +
theme(legend.position = "none")
)You can save the plots in a pdf for a proper visualization, using
cowplot::plot_grid function.
ggsave(
filename = 'my_noise_detection.pdf',
plot = cowplot::plot_grid(
plotlist = plots_list, nrow = 4, ncol = 4, labels = as.list(rep(.MinPts, each = length(.noise_th)))
),
width = 500, height = 250, units = "mm"
)From all these plots, we see that the the higher the
MinPts is, the more center-focused is the final clean
cluster. This is not a valid approach for all data sets, so the value of
MinPts must be defined properly in every case. In this
case, we decide that a good compromise solution is a value of
MinPts of 200 and a noise threshold of
5%:
plots_list[[15]]
