Compare two raw data files for discrepancies
Want to know how your data matches up from two different testing companies? If the files are in the same format as a 23andMe raw data file (rs id in the first column, genotype in the fourth column), you can use this script in the terminal window on a Mac or in Linux.
Navigate to the folder where your data files are located.
Sort your data first using this awk script:
awk '!/^#/ {print $1"\t"$4}' file1.txt | sort > file1_sorted.txt awk '!/^#/ {print $1"\t"$4}' file2.txt | sort > file2_sorted.txt
Then compare the files to see how many SNPs are in common and how many have discrepant genotypes.
OVERLAP=$(join -t $'\t' nucleus_sorted.txt 23andMe_sorted.txt | wc -l)
# Count discrepancies
DISCREPANCIES=$(wc -l < discrepancies.txt)
# Calculate matches
MATCHES=$((OVERLAP - DISCREPANCIES))
echo "Total overlapping SNPs: $OVERLAP"
echo "Matching genotypes: $MATCHES"
echo "Discrepant genotypes: $DISCREPANCIES"