Full Genome Sequence has Arrived! Now what?

J11 · Post by **J11** » Thu Jun 03, 2021 6:40 pm

Tremendously excited to have received my full genome sequence!

This would have cost ~$100 billion; 20 years ago. I don't have $100 billion!
OK, how about $299 with a possible coupon code?
Yeah!

J11 · Post by **J11** » Thu Jun 03, 2021 6:45 pm

split -l 1000000 filename.vcf

J11 · Post by **J11** » Thu Jun 03, 2021 6:47 pm

OK, the format I will use is first provide the code and then create an extended verbose comment.
Above code splits up the massive variant file for a full genome sequence into manageable file size.

The big question after receiving the vcf file is how to even cope with such a large file; it is about 1.5 gig with 4.5 million variants.
I was stumpted with that one for a few days, though now I have figured that one out-- split the file up.

Go to Windows command prompt; change directory until you arrive at where the huge vcf file is then enter:

split -l 1000000 filename.vcf

[Edit: I started up by splitting the file into 50 files. I tried it again with splitting it into only 5 files and this was still within the resources of my computer, so I'll go with 1 million lines in the split above. This will make it a great deal easier when I want to reassemble all of the files.]

This will break up the large vcf file into a manageable number of files (~50) each of 100,000 lines.
These files can then be processed further.

I will post my progress to this thread because there does not seem to be that much help online.
I am sure quite a few people have already received their full genome sequences and have given up in frustration trying to figure out what to do next. Formatting is the first step in unlocking the value of the genome.

J11 · Post by **J11** » Thu Jun 03, 2021 6:50 pm

# Code Chunk 1
# Documented version of the code.
# This just sets things up.
# Install the vcfR package only once; use the library call every time you use the program
# Sometime I will provide everyone a J1.vcf file to follow along
# Will need to download R and Rstudio for the code to work.
# Then open a Rmarkup file and dump the code chunks into the editor

install.packages("vcfR")
library(vcfR)
data(vcfR_example)
vcf

getwd()
vcf <- read.vcfR("xaa.vcf")

J11 · Post by **J11** » Thu Jun 03, 2021 6:51 pm

Above code reads in one of the split files from the full genome sequence.
Here you will need to download the free software R and RStudio.

J11 · Post by **J11** » Thu Jun 03, 2021 6:52 pm

# Code Chunk 2
# All this code does is extract the genotypes for each location as separate character strings
# I am not entirely sure whether this is the best approach for this; I thought why not
# put this into an array? We'll stick with this version for now; if trouble happens plan b.

# First pair of lines extracts the first genotype for each rsnumber and then turns it into an integer string
# Second pair of lines does the same thing for the second genotype.
# Next line then adds the genotype numbers together.
# This was the part that took about 100 lines of code before.
# Converting from character strings to integer strings turns it into a simple addition
# The cbind just puts everything together in one nice matrix

```{r}

vcfgt1<- substr(vcf@gt[,2], 1, 1)
intgt1<- strtoi(vcfgt1,base=10)

vcfgt3<- substr(vcf@gt[,2], 3, 3)
intgt3<- strtoi(vcfgt3,base=10)

intgt<- intgt1 +intgt3

Form2<- cbind(getFIX(vcf)[,3],getREF(vcf),getALT(vcf),intgt)
write.csv(Form2,'Form2out.csv')
```

J11 · Post by **J11** » Thu Jun 03, 2021 7:08 pm

Above code binds together the rsnumber, the genotype 1 and the genotype 2, and the sum of genotype 1 & 2 and outputs this as a comma separated file. Surprisingly, we are already making steady progress after only a few lines of code.

When we open the above file Form2out.csv in a spreadsheet, we see 100,000 rsnumbers along with three columns of 0/1,1 and 1/2.
What this is telling us is how many alternative alleles each rsnum has. If it had 2 reference alleles (i.e., 0/0 genotype) then it would not be reported in this file. Fortunately the file only provides variants that are not reference genotypes. Including all the genotypes in a full genome sequence would increase the number of variants reported from ~4 million to the full 3 billion which would completely crash a typical computer. All you need is what is different not what is the same.

What you can now do is copy and paste the rsnumbers and the final column into another spreadsheet and now you have the rsnumbers associated with the number of alternative alleles for 100,000 variants. (Tweaking the software by removing the first 2 variables would prevent even having to doing the copying and pasting). This is what you want to know. If you have a polygenic score and you want to calculate your score you need to multiply the beta by the number of variants and then sum up. The one tricky thing that will need to be worked around is that sometimes the reference isn't always the reference. Will need to go through and provide the reference from the full genome sequence and then compare it to the reference used in the polygenic scores).

Something remarkable to notice (and the motivator to all of this formatting) is how small the file now has become. The original full genome sequence vcf was ~ 1Gb which was completely unmanageable. Then the file was split into ~50 smaller files and the file sizes were reduced ~25 Mg. Now with the removal of all the extraneous vcf info we are down to mere 50 1 meg files.

It is possible that an entire full genome could be loaded into R Studio and the polygenic scores calculated with a single command.
That would be remarkably powerful.

Hmm, there are ~300 NA rsnumbers in the first split file. Basically, out of the 100,000 variants in the first file 300 are so weird that they do not even have an rsnum for them yet. Always encouraging to know that one has genotypes never encountered before till now in genetic research. Often those are exactly the most interesting variants to investigate.

OK, I just edited the above code so that the ref and alternate genotypes are right there in the file. I removed the intermediate count.
So only need to delete the first blank line in the spreadsheet and then remove the first column which is only counting the variants and the file is ready to go.

This might wind up being much easier than I had thought it would be.
Having the genome fully formatted with a manageable use of computer resources could allow very powerful analysis of polygenic scores. For example, there have been quite a number of PGS published for AD, IQ, depression etc.. If this all works to plan, it could be a simple matter of uploading the GWAS results and then running the software in R to calculate the PGS. That could be highly informative about a range of traits.

J11 · Post by **J11** » Thu Jun 03, 2021 8:31 pm

Ok, one snag is that when you move to the second file, the vcf file header is not present because the file has been split. Copy and paste
all the file header Statting with ##fileformat=VCFv and then carry down right to the end of the header and place this into the split files, save them as vcf and click through when it talks about losing format. The R code will then be able to handle the split files. This appears to be the only snag and then the files will be formatted. All ~5 million variants will be put into only 5 smallish files that could then be recombined.

One other trick is that when you are changing around directories it is important to reset the working directory in R so it knows where to find the files. Aside from that it appears that it is now smooth sailing.

J11 · Post by **J11** » Thu Jun 03, 2021 9:00 pm

All right, 5 million variants have now been processed into 5 comma separated files of a total size of only around 100Mb. These files have the rsnumbers, the ref and alternative alleles and the alternative counts. This is now fairly close to what I could use to almost effortlessly calculate polygenic scores. This would be extremely useful in translating the massive GWAS research efforts into a practical result. Perhaps we could even post urls for PGS research once the software code is fully up and running.

OK, I have one file with 5 million variants with rsnumbers, ref and alternative genotype, and the alternative count for each variant.
This is a smallish file of now only around ~85 MB.

What I would like to do now is to move this file into R Studio and then sort the rsnumbers.
Then I could take a sorted list of PGS rsnumbers and make one pass through the 5 million variants and find all my variants that matched the PGS variants. I would then have my polygenic for the hits. The last step would be to figure out how the reference genotypes in my genome relate to the genotypes for the PGS variants.

The easiest way around the question about finding the reference alleles for a PGS dataset might be to simply copy and paste the
rsnumbers and find an online app that could provide the ref allele. All variants that were not reported in the full genome sequence must be homozygous reference. I would have 2 of whatever the beta might be for that genotype.

J11 · Post by **J11** » Thu Jun 03, 2021 10:26 pm

Very exciting!

This url accepts batch enquiries for SNPs.
Enter in a list of rsnumbers and it will report back information including the reference allele.
Throw this into a spread sheet and then it looks like the reference allele can be grabbed as a substring.
With the reference alleles one could then compare it to the named allele that is given in the GWAS and determine the
effect of these genotypes in your genome.

https://www.ncbi.nlm.nih.gov/sites/batchentrez

Full Genome Sequence has Arrived! Now what?

Full Genome Sequence has Arrived! Now what?

Re: Full Genome Sequence has Arrived! Now what?

Re: Full Genome Sequence has Arrived! Now what?

Re: Full Genome Sequence has Arrived! Now what?

Re: Full Genome Sequence has Arrived! Now what?

Re: Full Genome Sequence has Arrived! Now what?

Re: Full Genome Sequence has Arrived! Now what?

Re: Full Genome Sequence has Arrived! Now what?

Re: Full Genome Sequence has Arrived! Now what?

Re: Full Genome Sequence has Arrived! Now what?