Intersect data frames based on chromosome, start and end.

genome_cluster(x, by = NULL, max_distance = 0,
  cluster_column_name = "cluster_id")

Arguments

x

A dataframe.

by

A character vector with 3 entries which are the chromosome, start and end column. For example: by=c("chr", "start", "end")

max_distance

The maximum distance up to which intervals are still considered to be the same cluster. Default: 0.

cluster_column_name

A string that is used as the new column name

Value

The dataframe with the additional column of the cluster

Examples

library(dplyr)
#> #> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:testthat’: #> #> matches
#> The following objects are masked from ‘package:stats’: #> #> filter, lag
#> The following objects are masked from ‘package:base’: #> #> intersect, setdiff, setequal, union
x1 <- data.frame(id = 1:4, bla=letters[1:4], chromosome = c("chr1", "chr1", "chr2", "chr1"), start = c(100, 120, 300, 260), end = c(150, 250, 350, 450)) genome_cluster(x1, by=c("chromosome", "start", "end"))
#> # A tibble: 4 x 6 #> id bla chromosome start end cluster_id #> <int> <fct> <fct> <dbl> <dbl> <dbl> #> 1 1 a chr1 100 150 0 #> 2 2 b chr1 120 250 0 #> 3 3 c chr2 300 350 2 #> 4 4 d chr1 260 450 1
genome_cluster(x1, by=c("chromosome", "start", "end"), max_distance=10)
#> # A tibble: 4 x 6 #> id bla chromosome start end cluster_id #> <int> <fct> <fct> <dbl> <dbl> <dbl> #> 1 1 a chr1 100 150 0 #> 2 2 b chr1 120 250 0 #> 3 3 c chr2 300 350 1 #> 4 4 d chr1 260 450 0