Full Genome Sequencing Polygenic Scores Software

Insights and discussion from the cutting edge with reference to journal articles and other research papers.
Post Reply
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Full Genome Sequencing Polygenic Scores Software

Post by J11 »

I am close to receiving my full genome sequence and I want to have the PGS software in an accessible location (i.e., here).
Software will need some polishing as I am not sure what R packages are needed etc. though this will at least keep the software
somewhere where it wouldn't be lost on my hard drive.
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Full Genome Sequencing Polygenic Scores Software

Post by J11 »

---
title: "Polygenic Scores"
file name: Polygenic Scores Great
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## This R code calculates the polygenic score for a set of genotypes

# Below code is now close to functional.

1. Format data rs# Genotype Good
2. Call the data for a given set of rs#. Good
3. Determine the number of reference alleles for each rs#. Good
4. From 3 I now have the dataframe X. Good
With the dataframe Y (which is aligned to Y) I can use code chunk 4 to calculate the PGS.


I will need to load the genotypes from the VCf file as some dataset-- Geno
It would be best to have Geno as rs number genotype


1. Format data rs# Genotype

I want rs# and then number of Reference , number of alt1, number of alt 2 ...
The below code is close to achieveing this.


The code towards the bottom appears to be able to do this.
I now have "rs#" "0/1" for the J1 vcf file.
This is quite close to what I want.
I could convert the vcf file to # Reference and # Alternative.
I would then have; the format that I need.



1a
```{r}

vcfgt<- substr(vcf@gt[,2], 1, 3)
Form1<- cbind(getFIX(vcf)[,3],vcfgt)
head(Form1)
summary(Form1)


vcfgt1<- substr(vcf@gt[,2], 1, 1)
vcfgt3<- substr(vcf@gt[,2], 3, 3)

Form2<- cbind(getFIX(vcf)[,3],vcfgt1,vcfgt3)
head(Form2)
Form2[1:20,]

summary(Form1)


Form2[1,]



```
1b.

I want to count the number of "0", "1', "2" ...

how many "0"s are in the column 2 in the first row?

Form2[1,]




16.3

```{r}
count<-0

if ( Form2[,2] ==0) { count<- 0
count<-count+1}
count
```

Will need to find a way to create a loop.
loops are not favored in R, find a way around the loop.

The below code is able to count the number of references and alternatives.


```{r}
Form2[1:10,2]
```


```{r}
count<- rep(c(0), times=nrow(Form2), each= 1)
count[1:10]



for (i in 1:nrow(Form2))
if ( Form2[i,2] ==0) {count<- 1}
ref1<- count
Form3<-cbind(Form2,ref1)
Form3[1:10,]

count2<-count

for (i in 1:nrow(Form2))
if ( Form2[i,3] ==0) {count2<- count+1}
Form4<-cbind(Form3,count2)
Form4[1:30,]

```

```{r}
count3<- rep(c(0), times=nrow(Form2), each= 1)


for (i in 1:nrow(Form2))
if ( Form2[i,2] ==1) {count3<- 1}
Form5<-cbind(Form4,count3)
Form5[1:10,]

count5<-count3

for (i in 1:nrow(Form2))
if ( Form2[i,3] ==1) {count5<- count5+1}
Form7<-cbind(Form5,count5)
Form7[1:30,]
```


FormGood<- Form7 [,-2]

```{r}

Form7[1:30,]
FormGood<- Form7 [,-2]
FormGood
```




```{r}
count<- rep(c(0), times=nrow(Form2), each= 1)
count[1:10]



for (i in 1:nrow(Form2))
if ( Form2[i,2] ==0) {count<- 1}
Form3<-cbind(Form2,count)
Form3[1:10,]

count2<-count

for (i in 1:nrow(Form2))
if ( Form2[i,3] ==0) {count2<- count+1}
Form4<-cbind(Form3,count2)
Form4[1:30,]
```


```{r}
count3<- rep(c(0), times=nrow(Form2), each= 1)


for (i in 1:nrow(Form2))
if ( Form2[i,2] ==1) {count3<- 1}
Form5<-cbind(Form4,count3)
Form5[1:10,]

count5<-count3

for (i in 1:nrow(Form2))
if ( Form2[i,3] ==1) {count5[i]<- count5[i]+1}
Form7<-cbind(Form5,count5)
Form7[1:30,]
```


2. Call the data for a given set of rs#.


I now have rs# Ref# Alt1#, Alt2# ...
I would have a list of rs# rs1 rs 2 rs 3 ...
I want to call the dataframe entry.


Each time I found a new polygenic score of interest I would request that
the genotypes be extracted.


So I will have a list of genotypes rs#s.
I want to match this list with my VCFmod file


I would then have a dataframe of rsnumbers and genotypes.


Variant 1

```{r}

rsnum<- "rs201219564"
#PGSgeno<- rep(c(0), times=7, each= 1)
PGSgeno<- matrix(data =1:70,nrow =10,ncol =7)



#PGSgeno[1]
Form7[2,]

for (i in 1:nrow(Form7))
if (!(is.na(Form7[i,1])))
if ( Form7[i,1] ==rsnum) {PGSgeno[1,]<- Form7[i,1:7]}
PGSgeno

Form7[1:10,]
Form7[5,1:7]



is.it.true <- function(x) {

if(x == TRUE) {print("x was true!")}
if(x == FALSE) {print("x was false!")}

}

is.it.true(Form2[1,1] ==rsnum)
Form2[1,1]


!(is.it.true ( Form2[i,1] ==NA))
Form2[i,1]


!(is.na( Form2[2,1]))
Form2[2,1]
matrix( rep( 0, len=25), nrow = 5)

```


This time I want to set up a matrix for the rsnumber and betas.
I could then just load in a spread sheet with information for 1 polygenic score-- genotypes, and betas.



```{r}
rsnum<- "rs201219564"

# Read in rsnumbers and Betas (Possibly also the reference genetype A1)


# Length is 2 times the number of SNPs.
# Iniitalize the rsnumber and beta variable
rsnumbeta<- matrix(rep( 0, len=200), nrow = 100)
rsnumbeta

rsnumbeta[1,1]<- "rs201219564"
rsnumbeta[2,1]<- "rs142727405"
rsnumbeta[3,1]<- "rs2272757"

# Above is unnecessary.
# Read in rsnumbers and Betas (Possibly also the reference genetype A1)
# From a spread sheet
# Variable is called rsnumbeta





#PGSgeno<- rep(c(0), times=7, each= 1)
PGSgeno<- matrix(rep(0, len=700), nrow = 100)
PGSgeno


#PGSgeno[1]
Form7[2,]

for (i in 1:nrow(Form7))

for (i in 1:100)
for(j in 1:3)
if (!(is.na(Form7[i,1])))
if ( Form7[i,1] ==rsnumbeta[j,1]) {PGSgeno[j,]<- Form7[i,1:7]}
PGSgeno


```


3. Determine the number of reference alleles for each rs#.

Question what is 0 and 1 in this format? 0 is reference? 1 is alternative?
Answer: 0 is reference, 1 is Alernative

Note: Might need to check what is reference in the vcf versus what is reference in the
PGS article

#This was done in step 2


```{r}

```



4. From 3 I now have the dataframe X.
With the dataframe Y (which is aligned to Y) I can use code chunk 4 to calculate the PGS.

# Code Chunk 4
# X is the number of genotypes present corresponding to the Beta.
# Y is the Beta for each genotype aligned with the columns of X



# I have read in rsnumbeta
# The second column of this variable is Y.
# so Y<- rsnumbeta[,2]

# X is Form7[,7]
# X<- Form7[,7]



```{r }


X1<-X
for (i in 1:ncol(X)) X1[,i]<-X[,i]*Y[i]
X1<-rowSums(X1)


X
Y
X1


```


```{r}


install.packages("vcfR")
library(vcfR)
data(vcfR_example)
vcf
```


```{r}
strwrap(vcf@meta[1:7])
```



```{r}
queryMETA(vcf)
```

```{r}
queryMETA(vcf, element = 'DP')
```

```{r}
queryMETA(vcf, element = 'FORMAT=<ID=DP')
```
```{r}
head(getFIX(vcf))

getFIX(vcf)[1:10,3]

```


```{r}
vcf@gt[1:6, 1:4]
```

```{r}
vcf@gt[1:6, 2]
```


```{r}
vcf@gt[1, 2]
```

```{r}

getwd()
vcf <- read.vcfR("J1.vcf")
```

```{r}
head(vcf)
```

```{r}

Form1<- cbind(getFIX(vcf)[,3],vcf@gt[,2])
head(Form1)
summary(Form1)

```




```{r}
vcf@gt[1, 2]
```


```{r}
substr(vcf@gt[1, 2], 1, 3)
```


```{r}
vcfgt<- substr(vcf@gt[,2], 1, 3)
Form1<- cbind(getFIX(vcf)[,3],vcfgt)
head(Form1)
summary(Form1)

```

```{r}
Form1[1:20,]
```


```{r}
Form1[1:100,]
```

```{r}
substr(vcf@gt[1, 2], 3, 3)
```

```{r}
vcf@gt[1:20, 2]
```

```{r}
substr(vcf@gt[1:20, 2], 3, 3)
```

```{r}
substr(vcf@gt[1:20, 2], 1, 1)
```

```{r}
vcf[1:20,]
```
```{r}
head(getFIX(vcf))
```

```{r}
getFIX(vcf[10,])
```

```{r}
getFIX(vcf[1:10,])

```

```{r}
getFIX(vcf[1:10,3])
```
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Full Genome Sequencing Polygenic Scores Software

Post by J11 »

Yeah! I was worried that this code would be unusable because I couldn't figure out what I did.

Looks like if I start with the install.packages code around line 430, then it runs nicely.
Probably was able to make different parts of it working at different times and then just did not
go back to line 1.

Once the vcf file is read in on line 430 then perhaps everything will have be functional.
If this code can be made workable, then all the polygenic scores for all the many traits (e.g., Alzheimer's) should be computable by the program.

Went through the code. There are surprisingly few errors once the vcf file is read in.
Code might be salvageable.


Errors
line 93 if ( Form2[,2] ==0) { count<- 0
count<-count+1}

line 274 is.it.true(Form2[1,1] ==rsnum)

line 278 !(is.it.true ( Form2[i,1] ==NA))
(looks like an omitted loop for i)

line 332 if ( Form7[i,1] ==rsnumbeta[j,1]) {PGSgeno[j,]<- Form7[i,1:7]}
line 333 PGSgeno

line 385-92 X and Y are not provided

line 573 getFIX(vcf[1:10,3])

(Not that many errors; mostly probably related to non-specified variables.)
User avatar
floramaria
Support Team
Support Team
Posts: 1423
Joined: Tue Jul 04, 2017 11:22 am
Location: Northern New Mexico

Re: Full Genome Sequencing Polygenic Scores Software

Post by floramaria »

J11 wrote:because I couldn't figure out what I did.
I really have no idea what you did but think this an impressive undertaking!
Functional Medicine Certified Health Coach
IFM/ Bredesen Training in Reversing Cognitive Decline (March 2017)
ReCODE 2.0 Health Coach with Apollo Health
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Full Genome Sequencing Polygenic Scores Software

Post by J11 »

Thank you flora.

The code is not that impressive, though I wanted to lift it off my hard drive before it became unfindable and completely incomprehensible. The goal here is to have a nice and simple program that would take a list of rsnumbers with their associated betas and output a polygenic score. Clearly, those on forum would be interested to see their score for AD, though there are many many others now out there. Posters might post such files to this thread and then everyone could access the information. Coding this, does not appear to be that challenging. The rough notes that I posted above is pretty close to what is needed.

It is a reminder though for all of us that the genetic age that we have talked about for so long really has arrived. Polygenic scores unlock the human genome and will give us deep insights into ourselves and others. While we might be just as happy to not know, the fact is that we do know and there is probably no use in pretending otherwise.
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Full Genome Sequencing Polygenic Scores Software

Post by J11 »

# Code Chunk 1
# Documented version of the code.
# This just sets things up.
# Install the vcfR package only once; use the library call every time you use the program
# Sometime I will provide everyone a J1.vcf file to follow along
# Will need to download R and Rstudio for the code to work.
# Then open a Rmarkup file and dump the code chunks into the editor

```{r}
install.packages("vcfR")
library(vcfR)
data(vcfR_example)
vcf

getwd()
vcf <- read.vcfR("J1.vcf")
```
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Full Genome Sequencing Polygenic Scores Software

Post by J11 »

# Code Chunk 2
# All this code does is extract the genotypes for each location as separate character strings
# I am not entirely sure whether this is the best approach for this; I thought why not
# put this into an array? We'll stick with this version for now; if trouble happens plan b.

# First pair of lines extracts the first genotype for each rsnumber and then turns it into an integer string
# Second pair of lines does the same thing for the second genotype.
# Next line then adds the genotype numbers together.
# This was the part that took about 100 lines of code before.
# Converting from character strings to integer strings turns it into a simple addition
# The cbind just puts everything together in one nice matrix


```{r}
vcfgt1<- substr(vcf@gt[,2], 1, 1)
intgt1<- strtoi(vcfgt1,base=10)

vcfgt3<- substr(vcf@gt[,2], 3, 3)
intgt3<- strtoi(vcfgt3,base=10)

intgt<- intgt1 +intgt3

Form2<- cbind(getFIX(vcf)[,3],vcfgt1,vcfgt3,intgt)
head(Form2)

```
Last edited by J11 on Mon Apr 05, 2021 6:45 pm, edited 1 time in total.
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Full Genome Sequencing Polygenic Scores Software

Post by J11 »

Owwh, I was worried that I was going to have to bluff it and fill up a page of code simply because I didn't know what I was doing.
It looks like I can simply avoid all that mess with a few lines of code. I added them above.

This is the current output (below).
As we see when we add across rows with the first two numbers we obtain the last number. Great!
So, 0 +1 =1 Correct
... 1+1 =2 Correct

This is telling us the number of alleles that we have.
This is moving towards what we want.
For a polygenic score we want to multiply the number of effect alleles with the beta.
Might be as easy as pairwise multiplication and we are done.

Not so easy.
The effect allele is not necessarily the alternative allele.
Will need to find a way to compare the effect allele with the reference/alternative.

vcfgt1 vcfgt3 intgt
[1,] NA "0" "1" "1"
[2,] "rs71252251" "0" "1" "1"
[3,] NA "0" "1" "1"
[4,] NA "0" "1" "1"
[5,] "rs201219564" "1" "1" "2"
[6,] "rs75062661" "1" "1" "2"
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Full Genome Sequencing Polygenic Scores Software

Post by J11 »

Code Chunk rough
# All that is needed to calculate the polygenic score is a scalar multiplication.
# Need to write a function that takes a set of rsnumbers from the polygenic set and then finds them in the vcf file



```{r}
beta<- c(1,2,1,4,5)
intgt[1:5]*beta
```
J11
Contributor
Contributor
Posts: 3351
Joined: Sat May 17, 2014 4:04 pm

Re: Full Genome Sequencing Polygenic Scores Software

Post by J11 »

Code Chunk rough
# Another stylized chunk that might help out.
# Idea is to take in the rsnums from the polygenic score and find there genotypes
# in the vcf file. Search through the vcf file and find genotypes.
# Put all of this into a new array poly.
# Poly would have everything: It would have the Form2 which is the genotypes separated out, the added number of alternatives,
it would have the beta and all the info for each rsnumber.
# Could then simply do a scalar multiplication of the allele number in Form2 and the rsnumberbeta.

# This is getting close.
# One step in the middle is to compare the ref/alt alleles in the rsnumber input file with the ref/alt alleles in the vcf to make
sure everything corresponds properly.



for (k in 1:rnum)
for (iij in 1:50000)
if (getID(vcf[iij])==rsnum[k])
{poly[k]<-cbind(Form2,rsnumbeta[k],(getFIX(vcf[iij,], getINFO = TRUE))}
Post Reply