Compare two raw data files for discrepancies

Want to know how your data matches up from two different testing companies? If the files are in the same format as a 23andMe raw data file (rs id in the first column, genotype in the fourth column), you can use this script in the terminal window on a Mac or in Linux.


Navigate to the folder where your data files are located.


Sort your data first using this awk script:


awk '!/^#/ {print $1"\t"$4}' file1.txt | sort > file1_sorted.txt awk '!/^#/ {print $1"\t"$4}' file2.txt | sort > file2_sorted.txt


Then compare the files to see how many SNPs are in common and how many have discrepant genotypes.


OVERLAP=$(join -t $'\t' nucleus_sorted.txt 23andMe_sorted.txt | wc -l)

# Count discrepancies

DISCREPANCIES=$(wc -l < discrepancies.txt)

# Calculate matches

MATCHES=$((OVERLAP - DISCREPANCIES))

echo "Total overlapping SNPs: $OVERLAP"

echo "Matching genotypes: $MATCHES"

echo "Discrepant genotypes: $DISCREPANCIES"