If efficiency is an issue and you have to build multiple different n-grams (up to a hundred as you say), but you want to use pure python I would do:
from itertools import chaindef n_grams(seq, n=1):"""Returns an itirator over the n-grams given a listTokens""" shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i) shiftedTokens = (shiftToken(i) for i in range(n)) tupleNGrams = zip(*shiftedTokens) return tupleNGrams # if join in generator : ("".join(i) for i in tupleNGrams)def range_ngrams(listTokens, ngramRange=(1,2)):"""Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens.""" return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))
Usage :
>>> input_list = input_list = 'test the ngrams generator'.split()>>> list(range_ngrams(input_list, ngramRange=(1,3)))[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]
~Same speed as NLTK:
import nltk%%timeitinput_list = 'test the ngrams interator vs nltk '*10**6nltk.ngrams(input_list,n=5)# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%%timeitinput_list = 'test the ngrams interator vs nltk '*10**6n_grams(input_list,n=5)# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%%timeitinput_list = 'test the ngrams interator vs nltk '*10**6nltk.ngrams(input_list,n=1)nltk.ngrams(input_list,n=2)nltk.ngrams(input_list,n=3)nltk.ngrams(input_list,n=4)nltk.ngrams(input_list,n=5)# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%%timeitinput_list = 'test the ngrams interator vs nltk '*10**6range_ngrams(input_list, ngramRange=(1,6))# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Repost from my previous answer.