Starting out in bioinformatics

So you want to do your thesis on a data driven project that heavily relies on computers doing work for you. This means that you need to be proficient in using computers. What skills do you need to be efficient?

This is a list of skills that I have found to be most useful to me and have enabled me to get as far as I have. It is a list of personal recommendations that I would teach to new students in this order. It is also a list in progress. I will update it as more essential things come to my mind.

1. Use a search engine

I would not have thought this required mentioning in 2017, but apparently using search engines such as DuckDuckGo or Google efficiently is not common knowledge. In an age where the internet has all the information you need to learn something, it is essential that you know how to find them. Your favorite search engine will be your most frequently visited website and the most important portal to all online resources.

Learn how to formulate broad and specific search queries and modify them to zero in on the answer you are looking for. Everything else can be picked up and learned from the internet — but you have to know what to look for and how.

2. Know the command line

The Linux/Unix command line (or shell) is, by far, the tool that will save you the most time if you know it well. Learn how to manage thousands of files with a single command, chain tools into powerful pipelines, and use loops to automate tedious workflows. Get used to at least one shell, such as Bash (preinstalled on most modern Linux systems and Mac OS) or Zsh, and the features it offers to save you even more time: shortcut keys, the history mechanism, variables and command substitution in subshells. Using a command line shell enables you to easily document and reproduce workflows by writing simple scripts to wrap and encapsulate recurring steps in your analyses.

Learn the basics of navigating and working with the file system. Move on to using the standard Unix tools for stream manipulation. Many tasks can be addressed by combining these well proven and efficient building blocks. Try to learn new aspects as you go and you will find yourself writing flexible pipelines in no time.

3. Learn a programming language

I know this sounds daunting, but I highly recommend putting in the effort. If you already know the shell, the extra learning curve will be shallower than starting from zero, and a programming language is an essential tool in your repertoire. You will need it for those problems that are too complex to solve efficiently with the shell.

Start with a scripting language that is easy to learn and use, and has good community resources (many established modules to solve common problems as well as a broad user base that has answers to recurring questions online). Popular examples are Python or Perl. AWK (also preinstalled on most Linux systems) is also helpful, especially within shell pipelines, but you can pick it up easily as you need it if you already know one of the others.

4. Learn R

R almost made it into the list of scripting languages above until I remembered that it is special. I think R is required because you need it to do statistics and good looking plots, something that every scientist needs. Many other things can be done in R, too, but its strengths are more in mathematical modeling and publication quality graphics than in working with line oriented text, which is why you often need a different language or tool to format the input for R.

Learning R will be frustrating when you move on from the most basic things to more complex tasks, but boy, are the results worth the effort.

Biodatacore

Evolution. Data. Science.