Imagine there is a data frame demo as follows and we need to combine columns which hold the same tag. What should we do?
### binary_var_dat
EGFR_T790M KRAS_K201S BRAF_K220Q NOTCH_N22K EGFR_V600E EGFR_Q333S EGFR_F222Q KRAS_F22X
S1 0 0 1 0 1 0 0 1
S1_xxxxxx 0 0 0 0 1 0 1 1
S2 0 0 1 0 0 0 0 0
S2_xxxxxx 1 0 1 0 1 0 1 0
S3 0 1 1 0 0 0 0 1
S3_xxxxxx 0 0 1 0 1 0 1 0
S4 0 0 0 1 1 0 0 1
S4_xxxxxx 0 1 1 0 0 0 1 1
S5 0 0 1 0 1 0 0 0
S5_xxxxxx 1 0 0 0 1 1 1 1
S6 0 0 1 0 0 0 0 1
S6_xxxxxx 0 0 1 0 1 0 1 1
### pre_gene_mat
EGFR KRAS BRAF NOTCH EGFR EGFR EGFR KRAS
S1 0 0 1 0 1 0 0 1
S1_xxxxxx 0 0 0 0 1 0 1 1
S2 0 0 1 0 0 0 0 0
S2_xxxxxx 1 0 1 0 1 0 1 0
S3 0 1 1 0 0 0 0 1
S3_xxxxxx 0 0 1 0 1 0 1 0
S4 0 0 0 1 1 0 0 1
S4_xxxxxx 0 1 1 0 0 0 1 1
S5 0 0 1 0 1 0 0 0
S5_xxxxxx 1 0 0 0 1 1 1 1
S6 0 0 1 0 0 0 0 1
S6_xxxxxx 0 0 1 0 1 0 1 1
### gene_mat
EGFR KRAS BRAF NOTCH
S1 1 1 1 0
S1_xxxxxx 2 1 0 0
S2 0 0 1 0
S2_xxxxxx 3 0 1 0
S3 0 2 1 0
S3_xxxxxx 2 0 1 0
S4 1 1 0 1
S4_xxxxxx 1 2 1 0
S5 1 0 1 0
S5_xxxxxx 4 1 0 0
S6 0 1 1 0
S6_xxxxxx 2 1 1 0
More specifically, if we want to convert binary_var_dat
to gene_mat
, what procedures can we take?
Here is my solution:
- Split each column name by its delimiter;
- Replace column names with their splited string;
- Merge columns with the same column name.
pre_gene_mat <- binary_var_dat
colnames(pre_gene_mat) <- sapply(strsplit(colnames(pre_gene_mat), split = "_"), '[', 1)
pre_gene_mat <- as.matrix(pre_gene_mat)
coln <- colnames(pre_gene_mat)
gene_mat <- pre_gene_mat %*% sapply(unique(coln),"==", coln)
To be honest, it almost cost me half an hour on this problem. The reason I spent so much time is that I have been suspecting that data frame
in R should only possess unique row names and column names. Surprisingly, the data frame
naming criterion is very flexible — unique row names and flexible column names.
In specific circumstances, if you want duplicated row names, just transpose the data frame and do whatever you want. I like this design philosophy.