genome wide motif positions take 3

This post describes a procedure for making a function in R that, powered by Bowtie, can retrieve all positions of a motif in the human genome. Bowtie uses the ultra fast Burrows Wheeler algorithm to search for substrings in large texts. This relies on a Bowtie index file which we need to fetch apart from the installation of the program. These are linked from the Bowtie homepage.

In R we make this function (please substitute the paths according to your system).

Note that we retrieve only results on the plus (+) strand, to be consistent with solution 1 and solution 2. Also note that Bowtie will not allow motifs with less than four characters.



genome wide motif positions take 2

This is the second method in the series of three, for obtaining positions of motifs in the human genome. Again, the method should provide the result in R, but this time the actual search machinery will be outside R, controlled by R’s system() command.

The first method relied on finding the motifs in chromosome text strings in R. Now we will use the command line workhorses sed and awk to do this job for us. The way it works is that sed will work on the 1-line chromosome fasta files created here and substitute the motif with new line characters. This is send to awk which will keep track of the number of characters in each line. Awk then outputs a comma separated list of the positions of the motif. This is read into R through the system() command.

The function also has an option of delivering a data.frame in bed coordinates instead, convenient if we need a bed file with these positions.

Again, to get the positions of our motif, we do:

Since this function does the searching outside of R, the memory issue (in R) is not significant, and parallelization is thus hard coded.

In the next post, I will show a similar solution that employ the power of Bowtie and the Burrows–Wheeler transform for finding genome wide motif positions.

genome wide motif positions take 1

I sometimes need to know the position of motifs in the genome, e.g. if I need to overlap features with a given motif. Other times I need a genomic background characteristic associated with the position of all instances of a given motif.

I thus need a reasonably fast function that can give me such positions, preferably so that I can work with them in R. This post is the first of three, where I implement three rahter different methods to perform this task. The first method will function purely in R, the other two methods will rely on magic outside of R.


I will need the human genome (thats the genome I work with). I obtain it in a fasta format from UCSC with:

Then we make plain 1-line sequence fiiles for each chromosome:

Next, we load these into R in a list object:

Now we make a function that returns all positions of a motif in a list:

And that’s it. Pretty simple really. Now to get all positions of the motif “TGAGTTC” we just do:

and voila! positions$chr1 contains all positions of the motif on chr1 and so on.

The hg19 object size is around 3.2 Gb, and thus require it’s share of memory.

In the next posts, I will show two examples that does not require memory on the R side, and in a third post I will benchmark the methods and discuss pros and cons.

find your previous code snippets in R

It’s been a while – two years actually – since my last post. No worries, here is a new but very short one. I am often in a situation where I, when using R, have forgotten the exact syntax of things. Although Rstudio helps a lot with suggestions of parameters etc., it doesn’t always make my day. Also, I often use command line programs with the system() command, and so no IDE help is present. I thus use a simple grep command to search through practically all my code. This is embedded in an R command so I can do it from Rstudio. The function looks like this:

I have it loaded at all times, and when I am suddenly in doubt about say the path to my human genome fasta file, I simply search for it using:

and the grep command for hg19.fa is run in a shell in the background with the output appearing in Rstudio’s console area. Then I can quickly see when the path to my file is found. I find this function handy when I try to remember where I might find that little code snippet I used a year ago and can only remember fragments of.



One can easily attach certain tasks to certain keys, such as the the function keys. Any (decent) OS would allow such configuration. But was it now F3 that started up Firefox or was it F5? And how about assigning a function key to write out the date in a word processor? Not possible. Perhaps from inside Word you could, but than it wouldn’t work in gedit or notepad. Wouldn’t it be great to have a tool where YOU decide the functional key AND the task it performs. That is applicable system wide. And even better, how about you are not restricted to function keys F1-12, but could start up Firefox by pressing ‘ff’ or write out the current date ANYWHERE (Excel, Word, notepad, terminal) by pressing ‘d’. Less pressure on your sparse memory cells with meaningful associations!

This desire has spurred the superspace project.

(As explained, Autokey, which does similar things, no longer appear maintained)

The basic idea with superspace is that once the program is started, it starts to listen to key strokes and perform a task depending on those stokes. The whole point being two fold; to avoid taking your hands of the board and be free to assign any keystroke combination into a task without having to worry about accidental activation.

It is a small and simple program that works like this:

1) An activating key stroke sequence, space bar press while super key is pressed, activates a key stroke listener program.

2) The listener records every key until another space bar press.

3) The recorded keys are decoded and translated into a task which is performed.

I use it a lot every day. Here are some examples:

When I type [super+spacel]gbrowser[space] regardless of where, superspace will open the UCSC genome browser in a Firefox window.
If I type [super+spacel]d[space] it will return me the current date and time in the format 20150403. If I instead type [super+spacel]dd[space] I will get 2015_04_03_10:57.

Superspace has spared me from many thousands of keystrokes and mouse gestures.



Autokey to the rescue

If you are anything like me, you hate repetitive tasks. One thing that particularly drives me nuts is repetitive tasks on the computer (performed by me! I am perfectly fine with repetition performed in silico). I have set up my computer cockpit so that it fits me and I like to be there. The physical part includes coffee right to the left of me and a double screen. The virtual part includes various tools. I spend most of my time on the command line and using R, which for my part means working in Rstudio. One tools that I have grown very fond of is Autokey, which is a customizable keystroke(s) configuration program that works everywhere (as in all system wide applications, but the whole thing is Linux specific).

The amount of keystrokes Autokey has saved me is countless. With Autokey you can bind long tedious phrases such as: /home/user/subfolder/subsubfolder/mypapers/
to [triggerkey(s)]mp.
I.e.  if you triggerkeys (like mine) are ++ then pressing ++mp will write out the path above and save you much typing. Also you can bind python commands to keystrokes, so in my cockpit, writing ++date today would write out 2015.04.03 and ++d would yield 20150403. ++awk will open my awk cheat sheet located at /home/user/documents/cheatsheets/awk1.txt in gedit. And my many other shortcuts will do all other kinds of things such as opening my Google calender in Firefox or ssh me to a work cluster. And all of this works regardless of whether you type it while at a command prompt or in a browsers google search prompt or without a prompt at all. Truly magnificent.

Sad was I when I realized that this extremely useful tool started to stall. I think it began at around the time when Unity was introduced in Ubuntu. Weird things started to happen when Autokey was on in certain editors, mainly Kate and Rstudio. Pressing and holding  keys (e.g. arrow or # character) would severely mess up the keyboard input until the application was restarted.
Sadly Autokey does not seem to be maintained anymore.
The purpose of this post is to introduce my workaround superspace in the next post and also to attract comments on how people deal with tedious computer tasks. And if anyone used Autokey and faced similar problems, I would like to hear about any workarounds (switching to Windows and use AutoHotKey does not  count).