一种方法awk:
$ awk '{a[$2 OFS $3]+=$1}END{for(k in a)print a[k],k}' file
1 Chr1 100821817
1 Chr1 100821818
1 Chr1 100820415
5 Chr1 100824427
1 Chr1 100824428
1 Chr1 100823536
一种方法python:
$ cat cluster.py
#!/usr/bin/env python
import fileinput
cluster = {}
for line in fileinput.input():
field = line.strip().split()
try:
cluster[' '.join(field[1:])] += int(field[0])
except KeyError:
cluster[' '.join(field[1:])] = int(field[0])
for key, value in cluster.items():
print value, key
使脚本可执行chmod +x cluster.py并运行如下:
$ ./cluster.py file
1 Chr1 100823536
1 Chr1 100821817
1 Chr1 100820415
5 Chr1 100824427
1 Chr1 100824428
1 Chr1 100821818
这两种方法在这里都使用了相同的技术,都利用了哈希表。我们使用awk关联数组和 python 字典。简单地说,两者都是数组,其中键不是数字而是字符串(第二和第三列值)。一个简单的例子:
blue 1
blue 2
red 5
blue 1
red 2
如果我们说awk '{a[$1]+=$2}' file,那么我们会得到以下信息:
Line Array Value Explanation
1 a["blue"] 1 # Entry in 'a' is created with key $1 and value $2
2 a["blue"] 3 # Add $2 on line 2 to a["blue"] so the new value is 3
3 a["blue"] 3 # The key $1 is red so a["blue"] does not change
a["red"] 5 # Entry in 'a' is created with new key "red"
4 a["blue"] 4 # Key "blue", Value 1, 1 + 3 = 4
a["red"] 5 # Key "blue", so a["red"] doesn't change
5 a["blue"] 4 # Key "red", so a["blue"] doesn't change
a["red"] 7 # Key "red", Value 2, 5 + 2 = 7