The idea is to organize a series of meeting and get familiar with modern scientific libraries and tools. In the beginning we can go trough the materials provided by Software Carpentry.

Once we get familiar with basic topics (for example, array operations, storage, plotting). We can have a look to specific problems from linguistics, biology, physics or any other field.

Anyone interested in the subject is very welcome to come! Please contact Dmitrijs Milajevs <d.milajevs@qmul.ac.uk> if you have any question.

## Third meeting: Biggish data processing

host: | Dmitrijs (Dima) Milajevs |
---|---|

difficulty: | easy |

download: | co-occurrence.ipynb |

show: | notebook |

Word similarity is the core notion in distributional semantics, where word meaning is represented as vectors. In such a vector space word similarity is modeled as the distance between two vectors. There are many datasets to evaluate distributional models, for example, SimLex-999.

During this meeting, we will build our own semantic vector space for the words in SimLex-999 and measure correlation of model similarity scores with human judgments using generators and Pandas.

## Second meeting: Estimate n-gram probabilities from a text corpus

host: | Dmitrijs (Dima) Milajevs |
---|---|

difficulty: | easy |

download: | n-grams.ipynb |

show: | notebook |

After the first meeting of the NLP seminar there was an idea to replicate Table 6.3 in [statistical-nlp].

During this meeting we can estimate the n-gram probabilities using MLE from some corpus. To make things more interesting (and faster) the n-grams counts can be stored as Pandas DataFrames.

You need to install Pandas to be able to run the code, or you can use Wakari, a IPython Notebook a cloud.

[statistical-nlp] | Manning, Christopher D. "Foundations of statistical natural language processing". Ed. Hinrich Schütze. MIT press, 1999. |

## First meeting: Analyzing Patient Data with Python

host: | Dmitrijs (Dima) Milajevs |
---|---|

difficulty: | easy |

Follow the Software Carpentry lesson Analyzing Patient Data. The tutorial goes trough basic data analysis steps: read the data, process it and present (plot) the result. In addition to the tutorial, we can have a look into IPython Notebook.

Please install Python and NumPy before the meeting! It might be tricky to get the packages, contact Dima, if you have any problems. You need to have Anaconda installed, ignore instructions about the text editor and everything else. After a successful install you should be able to do the following in a terminal (the version numbers are not that important):

```
ipython # or iPython on ipython-2.7
Python 2.7.8 (default, Oct 3 2014, 02:34:26)
Type "copyright", "credits" or "license" for more information.
IPython 2.3.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import numpy
In [2]: quit()
```

## Useful links

- http://software-carpentry.org/lessons.html
- http://scipy-lectures.github.io
- McKinney, Wes. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. "O'Reilly Media, Inc.", 2012.