genome wide motif positions take 3

This post describes a procedure for making a function in R that, powered by Bowtie, can retrieve all positions of a motif in the human genome. Bowtie uses the ultra fast Burrows Wheeler algorithm to search for substrings in large texts. This relies on a Bowtie index file which we need to fetch apart from the installation of the program. These are linked from the Bowtie homepage.

In R we make this function (please substitute the paths according to your system).

Note that we retrieve only results on the plus (+) strand, to be consistent with solution 1 and solution 2. Also note that Bowtie will not allow motifs with less than four characters.

Usage:

getMot3(“GCATTTGAC”).

genome wide motif positions take 2

This is the second method in the series of three, for obtaining positions of motifs in the human genome. Again, the method should provide the result in R, but this time the actual search machinery will be outside R, controlled by R’s system() command.

The first method relied on finding the motifs in chromosome text strings in R. Now we will use the command line workhorses sed and awk to do this job for us. The way it works is that sed will work on the 1-line chromosome fasta files created here and substitute the motif with new line characters. This is send to awk which will keep track of the number of characters in each line. Awk then outputs a comma separated list of the positions of the motif. This is read into R through the system() command.

The function also has an option of delivering a data.frame in bed coordinates instead, convenient if we need a bed file with these positions.

Again, to get the positions of our motif, we do:


Since this function does the searching outside of R, the memory issue (in R) is not significant, and parallelization is thus hard coded.

In the next post, I will show a similar solution that employ the power of Bowtie and the Burrows–Wheeler transform for finding genome wide motif positions.

genome wide motif positions take 1

I sometimes need to know the position of motifs in the genome, e.g. if I need to overlap features with a given motif. Other times I need a genomic background characteristic associated with the position of all instances of a given motif.

I thus need a reasonably fast function that can give me such positions, preferably so that I can work with them in R. This post is the first of three, where I implement three rahter different methods to perform this task. The first method will function purely in R, the other two methods will rely on magic outside of R.

Prerequisites

I will need the human genome (thats the genome I work with). I obtain it in a fasta format from UCSC with:

Then we make plain 1-line sequence fiiles for each chromosome:

Next, we load these into R in a list object:

Now we make a function that returns all positions of a motif in a list:

And that’s it. Pretty simple really. Now to get all positions of the motif “TGAGTTC” we just do:

and voila! positions$chr1 contains all positions of the motif on chr1 and so on.

The hg19 object size is around 3.2 Gb, and thus require it’s share of memory.

In the next posts, I will show two examples that does not require memory on the R side, and in a third post I will benchmark the methods and discuss pros and cons.

find your previous code snippets in R

It’s been a while – two years actually – since my last post. No worries, here is a new but very short one. I am often in a situation where I, when using R, have forgotten the exact syntax of things. Although Rstudio helps a lot with suggestions of parameters etc., it doesn’t always make my day. Also, I often use command line programs with the system() command, and so no IDE help is present. I thus use a simple grep command to search through practically all my code. This is embedded in an R command so I can do it from Rstudio. The function looks like this:

I have it loaded at all times, and when I am suddenly in doubt about say the path to my human genome fasta file, I simply search for it using:

and the grep command for hg19.fa is run in a shell in the background with the output appearing in Rstudio’s console area. Then I can quickly see when the path to my file is found. I find this function handy when I try to remember where I might find that little code snippet I used a year ago and can only remember fragments of.

 

kybocosh

So, as explained, my beloved Autokey no longer appear maintained, and it started to corrupt my system. But over the years I grew fond of it, to the point that I cant, no I wont, live without (some of) its functionality. So I decided to hack my way to a substitution.

Here I will describe in detail how I do this. Lets go ambitious and call it a piece of software and hence give it a name. If not for other reasons, than because it eases reference in the text.
Thy name shall be kybocosh (KeYBOard COmmand Shortcuts)

The way my hack (now promoted to the software kybocosh) works is through three modules:
1) A key stroke listen and recording module (KSL)
2) A read and execute module (RE)
3) A command module (CM)
The basic idea is that once the kybocosh app (note how it changes status) is started, the KSL module starts listening to every keyboard press and release and records to a file. At the same time, the RE-module reads this file and looks for evidence of the combination of keystrokes that define a trigger sequence (++ in this post, here [ctrl][ctrl]) and a shortcut ending sequence ([ctrl]). If succesfully found, the RE greps the characters in between and looks them up in the CM-module which holds a pairing of shortcuts and codes to be executed.
As a consequence, when I type [ctrl][ctrl]gbrowser[ctrl] regardless of where, kybocosh will open the UCSC genome browser in a Firefox window. If I type
[ctrl][ctrl]d[ctrl] e.g. in a file naming dialog box, it will return me the current date and time, e.g. 2015_04_03_10:57.

Below is the code for the KSL and RE module. The first part is the KSL module.

Then comes the RE module part.


The code:

Ok, thats the code. I know this is a little messy, and if I get time, I will write this up in a more friendly way. I will try to update this soon also with the command module, and a more thourough explanation of how things work.

 

 

 

Autokey to the rescue

If you are anything like me or most other people, you hate repetitive tasks. One thing that particularly drives me nuts is repetitive tasks on the computer. I have set up my computer cockpit so that it fits me and I like to be there. The physical part includes coffee right to the left of me and a double screen. The virtual part includes various tools. I spend most of my time on the command line and using R, which for my part means working in Rstudio. One tools that I have grown very fond of is Autokey, which is a  costumizable keystroke(s) configuration program that works everywhere (Linux specific).

The amount of keystrokes Autokey has saved me is countless. With Autokey you can bind long tedious phrases such as: /home/user/subfolder/subsubfolder/mypapers/
to [triggerkey(s)]mp.
I.e.  if you triggerkeys (like mine) are ++ then pressing ++mp will write out the path above and save you much typing. Also you can bind python commands to keystrokes, so in my cockpit, writing ++date today would write out 2015.04.03 and ++d would yield 20150403. ++awk will open my awk cheat sheet located at /home/user/documents/cheatsheets/awk1.txt in gedit. And my 100+ other shortcuts will do all other kinds of things such as opening my Google calender in Firefox or ssh me to a work cluster. And all of this works regardless of whether you type it while at a command prompt or in a browsers google search prompt or without a prompt at all. Truly magnificent.

Sad was I when I realized that this extremely useful tool started to stall. I think it began at around the time when Unity (course it. You get used to it though.) was introduced in Ubuntu (my Linux flavour). Weird things started to happen when Autokey was on in certain editors, mainly Kate and Rstudio. Pressing and holding  keys (e.g. arrow or # character) would severely mess up the keyboard input until the application was restarted.
Sadly Autokey does not seem to be maintained anymore.
The purpose of this post is to introduce my workaround in the next post(s) and also to attract comments on how people deal with tedious computer tasks. And if anyone used Autokey and faced similar problems, I would like to hear about any workarounds (switching to Windows and use AutoHotKey does not  count).