Cohesive clustering: Overcomes overlinking issues with clusterByThreshold.
With the naive approach used in clusterByThreshold a cohesive cluster of tightly-knit elements
can have a new lower-scoring rogue element added which "bridges meaning" and weakly connects the
existing island to a different island of meaning. The two dense clusters bridged by this single
element represent the shape of a dumbell when visualized as a graph. Were it not for the weak bridge
connecting them they would be entirely separate clusters.
One example problem might be a cluster that represents "ice" as in "US immigration control" and ice
as in snow and weather. A single bridging document (e.g. "ICE agents struggle through snow") might be
the only connection between two otherwise separate collections.
This clusterWithCohesion function still connects using similarity thresholds but then prunes
elements that lack cohesion due to weakly connected large groups inside.
Starts from thresholded graph, then prune "bridge" edges
using sampled shortest paths between random vertex pairs (approx edge betweenness),
recompute components, repeat a few rounds.
Returns clusters as arrays of node indices, sorted by size.
Cohesive clustering: Overcomes overlinking issues with clusterByThreshold. With the naive approach used in clusterByThreshold a cohesive cluster of tightly-knit elements can have a new lower-scoring rogue element added which "bridges meaning" and weakly connects the existing island to a different island of meaning. The two dense clusters bridged by this single element represent the shape of a dumbell when visualized as a graph. Were it not for the weak bridge connecting them they would be entirely separate clusters.
One example problem might be a cluster that represents "ice" as in "US immigration control" and ice as in snow and weather. A single bridging document (e.g. "ICE agents struggle through snow") might be the only connection between two otherwise separate collections.
This clusterWithCohesion function still connects using similarity thresholds but then prunes elements that lack cohesion due to weakly connected large groups inside.
Starts from thresholded graph, then prune "bridge" edges using sampled shortest paths between random vertex pairs (approx edge betweenness), recompute components, repeat a few rounds.
Returns clusters as arrays of node indices, sorted by size.