Quantcast
Channel: n-grams in python, four, five, six grams? - Stack Overflow
Viewing all articles
Browse latest Browse all 18

Answer by Yann Dubois for n-grams in python, four, five, six grams?

$
0
0

If efficiency is an issue and you have to build multiple different n-grams (up to a hundred as you say), but you want to use pure python I would do:

from itertools import chaindef n_grams(seq, n=1):"""Returns an itirator over the n-grams given a listTokens"""    shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)    shiftedTokens = (shiftToken(i) for i in range(n))    tupleNGrams = zip(*shiftedTokens)    return tupleNGrams # if join in generator : ("".join(i) for i in tupleNGrams)def range_ngrams(listTokens, ngramRange=(1,2)):"""Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""    return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))

Usage :

>>> input_list = input_list = 'test the ngrams generator'.split()>>> list(range_ngrams(input_list, ngramRange=(1,3)))[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~Same speed as NLTK:

import nltk%%timeitinput_list = 'test the ngrams interator vs nltk '*10**6nltk.ngrams(input_list,n=5)# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%%timeitinput_list = 'test the ngrams interator vs nltk '*10**6n_grams(input_list,n=5)# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%%timeitinput_list = 'test the ngrams interator vs nltk '*10**6nltk.ngrams(input_list,n=1)nltk.ngrams(input_list,n=2)nltk.ngrams(input_list,n=3)nltk.ngrams(input_list,n=4)nltk.ngrams(input_list,n=5)# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%%timeitinput_list = 'test the ngrams interator vs nltk '*10**6range_ngrams(input_list, ngramRange=(1,6))# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Repost from my previous answer.


Viewing all articles
Browse latest Browse all 18