Text manipulations

Christophe@pallier.org

Sept. 2013

Strings

In Python, text can be stored in objects of type ‘str’ (a.k.a as ‘strings’)

String constants are enclosed between single or double quotes

'bonjour'

"bonjour Paris!"

"""hello
ceci est un text
sur plusieurs lignes
"""

type('123')

type(123)

123 + 456

'123' + '456'

int('123')  # converting str into int
str(1 + 1)  # converting int into str
mystring = 'superman'

len(mystring)

mystring[0]
mystring[1]
mystring[1:5]

for letter in mystring:
    print(letter)

A set of functions to manipulate strings is available in the module ‘string’ (see https://docs.python.org/2/library/string.html). Among others, you should know about:

Interactive input from the command line:

name = raw_input('Comment vous appelez-vous ? ')

print "Bonjour " + name + '!'

Reading and writing to text files

Create a text file ’essa

```python

writing:

filename = 'test.txt'
handle = open(filename, 'w')
handle.write('welcome')
handle.write('to the wonderful')
handle.write('world of Python!')
handle.close()

Exercices

Download Alice in Wonderland.

  1. Write a program that prints the lines that contains the string ‘Alice’ (tip: you can use the find function from the module string). Then, test the same program with the strings ‘Rabbit’, ‘rabbit’, ‘stone’, ‘office’.
  1. Here is a program that converts the text file into a list of words, removing the punctuation marks and converting everything in lower case. Run it.
import string
def remove_punctuation(text):
    punct = string.punctuation + chr(10)
    return text.translate(string.maketrans(punct, " " * len(punct)))

textori = file('alice.txt').read().lower()
text = remove_punctuation(textori)
words = text.split()
print(words)

Now write a script that counts the number of occurences of ‘Alice’, ‘Rabbit’ or ‘office’ in the list of words.

  1. Read about Python’s Dictionnaries http://docs.python.org/2/tutorial/datastructures.html#dictionaries and use a dictonnary to store the number of occurrences of each word in Alice in Wonderland (the keys are the words, and the values and the number of occurrences; if word= [‘a’, ‘a’, ‘b’]; dico={‘a’:2, ‘b’:1}).
  1. Use numpy and matplotlib to plot the word log(frequencies) as a function of the rank of words on the abscissae (the most frequence word being ranked #1)

You can skim through http://matplotlib.org/users/pyplot_tutorial.html.

Remark: The product rank X frequency is roughly constant. This ‘law’ was discovered by Estoup and popularized by Zipf. See http://en.wikipedia.org/wiki/Zipf%27s_law.

  1. (advanced) Plot the relationship between word length and word frequency.
  1. Generate random text (each letter from a-z being equiprobable, and the spacecharacter being 8 times more probable) of 1 million characters. Compute the frequencies of each ‘pseudowords’ and plot the rank/frequency diagram.
  1. (advanced) compute the table of transition frequencies between words in Alice and generate random text following this pattern.
  1. Read about the MU Puzzle (http://en.wikipedia.org/wiki/MU_puzzle). Write a program that generates sequences of strings based on the following production rules and the initial state ‘MI’
  1. xI -> xIU
  2. Mx -> Mxx
  3. xIIIy -> xUy
  4. xUUy -> xy

(Tip: use the function string.replace)