API

All functionalities of prestools can be accessed after having imported the desired module into Python, as in import prestools.bioinf as pb and similars. In addition, some functions are also available as Command-Line Interface commands, but this should not be relied on.

Python module functions

prestools.bioinf

prestools.bioinf.aa_one_to_three(sequence: str) → str[source]

Convert one-letter amino acid code to three-letter code.

Parameters

sequence – sequence of amino acids in one-letter code

Returns

sequence converted to three-letter code

Return type

new_seq

prestools.bioinf.aa_three_to_one(sequence: str) → str[source]

Convert three-letter amino acid code to one-letter code.

Parameters

sequence – sequence of amino acids in three-letter code

Returns

sequence converted to one-letter code

Return type

new_seq

prestools.bioinf.hamming_distance(seq_1: str, seq_2: str, ignore_case: bool = False) → int[source]

Calculate the Hamming distance between two sequences.

Parameters
  • seq_1 – first sequence to compare

  • seq_2 – second sequence to compare

  • ignore_case – ignore case when comparing sequences (default: False)

Returns

Hamming distance

Return type

distance

prestools.bioinf.jukes_cantor_distance(seq_1: str, seq_2: str) → float[source]

Calculate the Jukes-Cantor distance between two sequences.

Return the Jukes-Cantor distance between seq_1 and seq_2, calculated as distance = -b log(1 - p/b) where b = 3/4 and p = p_distance.

Parameters
  • seq_1 – first sequence to compare

  • seq_2 – second sequence to compare

Returns

Jukes-Cantor distance

Return type

distance

prestools.bioinf.kimura_distance(seq_1: str, seq_2: str) → float[source]

Calculate the Kimura 2-Parameter distance between two sequences.

Return the Kimura 2-Parameter distance between seq_1 and seq_2, calculated as distance = -0.5 log((1 - 2p -q) * sqrt( 1 - 2q )) where p = transition frequency and q = transversion frequency.

Parameters
  • seq_1 – first sequence to compare

  • seq_2 – second sequence to compare

Returns

Kimura distance

Return type

distance

prestools.bioinf.mutate_sequence(sequence: str, mutations: int = 1, alphabet: str = 'nt') → str[source]

Mutate a sequence introducing a given number of mutations.

Introduce a specific number of mutations into the given sequence.

Parameters
  • sequence – input sequence to mutate

  • mutations – number of mutations to introduce (default: 1)

  • alphabet – character alphabet to use (‘nt’, ‘aa’) (default: ‘nt’)

Returns

mutated sequence

Return type

sequence

prestools.bioinf.nt_frequency(sequence: str) → Dict[str, float][source]

Calculate nucleotide frequencies.

Return a dictionary with nucleotide frequencies from the given sequence.

Parameters

sequence – input nucleotide sequence

Returns

dictionary of nucleotide frequencies

Return type

freqs

prestools.bioinf.p_distance(seq_1: str, seq_2: str) → float[source]

Calculate the pairwise distance between two sequences.

Return the uncorrected distance between seq_1 and seq_2.

Parameters
  • seq_1 – first sequence to compare

  • seq_2 – second sequence to compare

Returns

pairwise distance

Return type

distance

prestools.bioinf.quantile_norm(x: numpy.ndarray, to_log: bool = False) → numpy.ndarray[source]

Normalize the columns of X to each have the same distribution.

Given an expression matrix (microarray data, read counts, etc) of M genes by N samples, quantile normalization ensures all samples have the same spread of data (by construction).

The data across each row are averaged to obtain an average column. Each column quantile is replaced with the corresponding quantile of the average column.

Parameters
  • x – array of input data, of shape (N_genes, N_samples)

  • to_log – log-transform the data before normalising (default: False)

Returns

array of normalised data, of shape (N_genes, N_samples)

Return type

xn

prestools.bioinf.random_sequence(length: Union[int, str], alphabet: str = 'nt') → str[source]

Create a random sequence of the given length.

Create a random sequence of the given length using the specified alphabet (nucleotides or amino acids).

Parameters
  • length – desired length of the random sequence

  • alphabet – character alphabet to use (‘nt’, ‘aa’) (default: ‘nt’)

Returns

new random sequence

Return type

sequence

prestools.bioinf.reverse_complement(sequence: str, conversion: str = 'reverse_complement') → str[source]

Convert a nucleotide sequence into its reverse complement.

Convert a nucleotide sequence into its reverse, complement or reverse complement.

Parameters
  • sequence – nucleotide sequence to be converted

  • conversion – type of conversion to perform (‘r’|’reverse’, ‘c’|’complement’, ‘rc’|’reverse_complement’) (default: ‘rc’|’reverse_complement’)

Returns

converted sequence

prestools.bioinf.rpkm(counts: numpy.ndarray, lengths: numpy.ndarray) → numpy.ndarray[source]

Calculate reads per kilobase transcript per million reads.

RPKM = (10^9 * C) / (N * L)

Where: C = Number of reads mapped to a gene N = Total mapped reads in the experiment L = Exon length in base pairs for a gene

Parameters
  • counts – count data where columns are individual samples and rows are genes, of shape (N_genes, N_samples)

  • lengths – gene lengths in base pairs in the same order as the rows in counts, of shape (N_genes, )

Returns

RPKM normalized counts matrix, of

shape (N_genes, N_samples)

Return type

normed

prestools.bioinf.shuffle_sequence(sequence: str) → str[source]

Shuffle the given sequence.

Randomly shuffle a sequence, maintaining the same composition.

Parameters

sequence – input sequence to shuffle

Returns

shuffled sequence

Return type

tmp_seq

prestools.bioinf.tajima_nei_distance(seq_1: str, seq_2: str) → float[source]

Calculate the Tajima-Nei distance between two sequences.

Return the Tajima-Nei distance between seq_1 and seq_2, calculated as distance = -b log(1 - p / b) where b = 0.5 * [1 - Sum i from A to T(Gi^2+p^2/h)] h = Sum i from A to G(Sum j from C to T (Xij^2/2*Gi*Gj)) p = p-distance Xij = frequency of pair (i,j) in seq1 and seq2, with gaps removed Gi = frequency of base i over seq1 and seq2

Parameters
  • seq_1 – first sequence to compare

  • seq_2 – second sequence to compare

Returns

Tajima-Nei distance

Return type

distance

prestools.bioinf.tamura_distance(seq_1: str, seq_2: str) → float[source]

Calculate the Tamura distance between two sequences.

Return the Tamura distance between seq_1 and seq_2, calculated as distance = -C log(1 - P/C - Q) - 0.5(1 - C)log(1 - 2Q) where P = transition frequency Q = transversion frequency C = GC1 + GC2 - 2 * GC1 * GC2 GC1 = GC-content of seq_1 GC2 = GC-content of seq_2

Parameters
  • seq_1 – first sequence to compare

  • seq_2 – second sequence to compare

Returns

Tamura distance

Return type

distance

prestools.clustering

prestools.clustering.find_n_clusters_elbow(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], plot: bool = False, method: str = 'ward') → Union[int, None, ValueError][source]

Find the suggested number of clusters using the elbow method.

Find the suggested number of clusters for the given dataframe of correlations, using the elbow method.

Parameters
  • df – input dataframe of correlations

  • plot – plot the resulting elbow plot (default: False)

  • method – method to use to cluster the data (‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’) (default: ‘ward’)

Returns

number of clusters found

Return type

n_clusters

prestools.clustering.hierarchical_clustering(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], method: str = 'ward') → Union[prestools.classes.HierCluster, None, ValueError][source]

Hierarchical cluster of a dataframe.

Return clustering created using scipy from a given dataframe of correlations, using the HierCluster class available in prestools.classes.

Parameters
  • df – input dataframe of correlations

  • method – method to use to cluster the data (‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’) (default: ‘ward’)

Returns

instance of prestools.classes.HierCluster()

Return type

cl

prestools.graph

prestools.graph.flatten_image(img: numpy.ndarray, scale: bool = False) → numpy.ndarray[source]

Convert an image array to a single-dimension vector.

Parameters
  • img – input image array of shape (l, h, d = 3)

  • scale – scale resulting vector dividing its values by 255 (default: False)

Returns

reshaped vector of shape (l * h * d, 1)

Return type

v

prestools.graph.plot_confusion_matrix(cm: numpy.ndarray, class_names: List[str], title: str = 'Confusion Matrix', cmap: str = 'Reds', normalize: bool = False, save: Union[bool, str] = False)[source]

Create a plot from a confusion matrix array.

Parameters
  • cm – input confusion matrix array

  • class_names – class names to use

  • title – title for resulting plot (default: ‘Confusion Matrix’)

  • cmap – colormap to use (default: ‘RdBu_r’)

  • normalize – use classes ratios instead of raw numbers (default: False)

  • save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)

prestools.graph.plot_dendrogram(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], cut_off: Union[bool, float] = False, title: str = 'Dendrogram', save: Union[bool, str] = False, method: str = 'ward')[source]

Plot a dendrogram plot from a dataframe.

Create (and optionally save) a dendrogram plot starting from a given dataframe of correlations. It is also possible to add a cut-off line given a distance to use for separating clusters.

Parameters
  • df – input dataframe of correlations

  • cut_off – if not False, a vertical line will be added to better identify clusters (default: False)

  • title – title for resulting plot (default: ‘Dendrogram’)

  • save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)

  • method – method to use to cluster the data (default: ‘ward’)

prestools.graph.plot_heatmap_dendrogram(df: pandas.core.frame.DataFrame, cmap: str = 'RdBu_r', title: str = 'Cluster Heatmap', save: Union[bool, str] = False, method: str = 'ward')[source]

Plot a heatmap with hierarchical clustering of a dataframe.

Create (and optionally save) a heatmap with hierarchical clustering created using Seaborn, starting from a given dataframe of correlations.

Parameters
  • df – input dataframe of correlations

  • cmap – colormap to use (default: ‘RdBu_r’)

  • title – title for resulting plot (default: ‘Cluster Heatmap’)

  • save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)

  • method – method to use to cluster the data (default: ‘ward’)

prestools.graph.reduce_xaxis_ticks(ax: matplotlib.axes._axes.Axes, step: int)[source]

Show every ith x axis tick.

Parameters
  • ax – axis to be adjusted

  • step – factor to reduce the number of x axis ticks by

Examples

>>> fig, ax = plt.subplots()
>>> reduce_xaxis_ticks(ax, 5)
prestools.graph.reduce_yaxis_ticks(ax: matplotlib.axes._axes.Axes, step: int)[source]

Show every ith y axis tick.

Parameters
  • ax – axis to be adjusted

  • step – factor to reduce the number of y axis ticks by

Examples

>>> fig, ax = plt.subplots()
>>> reduce_yaxis_ticks(ax, 5)

prestools.misc

prestools.misc.apply_parallel(df: pandas.core.frame.DataFrame, function: Callable, cores: int = 4) → pandas.core.frame.DataFrame[source]

Apply a function to a dataframe in parallel.

Apply the given function to the dataframe, using the given number of cores for computation. The dataframe will be split in cores part, and the function will be applied to each separately; finally, the dataframe is reconstructed and returned.

Parameters
  • df – input dataframe

  • function – function to apply

  • cores – number of cores to use (default: 4)

Returns

resulting dataframe

Return type

df

prestools.misc.benchmark(function: Callable) → Callable[source]

Benchmark a given function.

Decorator to run the given function and return the function name and the amount of time spent in executing it.

Parameters

function – function to benchmark

prestools.misc.equal_files(file1: str, file2: str) → bool[source]

Check whether two files are identical.

First check whether the files have the same size, if so read them and check their content for equality.

Parameters
  • file1 – first file to compare

  • file2 – second file to compare

prestools.misc.filter_type(input_list: List[Any], target_type: Type) → List[Any][source]

Only keep elements of a given type from a list of elements.

Traverse a list and return a new list with only elements of the original list belonging to a given type.

Parameters
  • input_list – input list to filter

  • target_type – desired type to keep

Returns

filtered list

Return type

filtered

prestools.misc.flatten(iterable: Iterable, drop_null: bool = False) → List[Any][source]

Flatten out a nested iterable.

Flatten a nested iterable, even with multiple nesting levels and different data types. It is also possible to drop null values (None) from the resulting list.

Parameters
  • iterable – nested iterable to flatten

  • drop_null – filter out None from the flattened list (default: False)

Returns

flat list

prestools.misc.invert_dict(input_dict: dict, sort_keys: bool = False) → dict[source]

Create a new dictionary swapping keys and values.

Invert a given dictionary, creating a new dictionary where each key is created from a value of the original dictionary, and its value is the key that it was associated to in the original dictionary (e.g. invert_dict({1: [“A”, “E”], 2: [“D”, “G”]}) = {“A”: 1, “E”: 1, “D”: 2, “G”: 2}). It is also possible to return an inverted dictionary with keys in alphabetical order, although this makes little sense for intrinsically unordered data structures like dictionaries, but it may be useful when printing the results.

Parameters
  • input_dict – original dictionary to be inverted

  • sort_keys – sort the keys in the inverted dictionary in alphabetical order (default: False)

Returns

inverted dictionary

Return type

new_dict

prestools.misc.prime_factors(number: int) → List[int][source]

Calculate the prime factors of a number.

Calculate the prime factors of a given natural number. Note that 1 is not a prime number, so it will not be included.

Parameters

number – input natural number

Returns

list of prime factors

Return type

factors

prestools.misc.wordcount(sentence: str, word: Union[bool, str] = False, ignore_case: bool = False) → Union[dict, int][source]

Count occurrences of words in a sentence.

Return the number of occurrences of each word in the given sentence, in the form of a dictionary; it is also possible to directly return the number of occurrences of a specific word.

Parameters
  • sentence – input sentence to count words from

  • word – target word to count occurrences of

  • ignore_case – ignore case in the given sentence (default: False)

Returns

dictionary of word counts

Return type

word_dict


Command Line Interface

prestools bioinf

bioinf

Bioinformatics utilities

bioinf [OPTIONS] COMMAND [ARGS]...
hamming-distance

Hamming distance between two sequences

Calculate the Hamming distance between SEQ_1 and SEQ_2.

bioinf hamming-distance [OPTIONS] SEQ_1 SEQ_2

Options

-i, --ignore_case

Ignore case when comparing sequences (default: False)

Arguments

SEQ_1

Required argument

SEQ_2

Required argument

jukes-cantor-distance

Jukes-Cantor distance between two sequences

Return the Jukes-Cantor distance between SEQ_1 and SEQ_2, calculated as distance = -b log(1 - p/b) where b = 3/4 and p = p_distance.

bioinf jukes-cantor-distance [OPTIONS] SEQ_1 SEQ_2

Arguments

SEQ_1

Required argument

SEQ_2

Required argument

kimura-distance

Kimura 2-Parameter distance between two sequences

Return the Kimura 2-Parameter distance between SEQ_1 and SEQ_2, calculated as distance = -0.5 log((1 - 2p -q) * sqrt( 1 - 2q )) where p = transition frequency and q = transversion frequency.

bioinf kimura-distance [OPTIONS] SEQ_1 SEQ_2

Arguments

SEQ_1

Required argument

SEQ_2

Required argument

p-distance

Pairwise distance between two sequences

Return the uncorrected distance between SEQ_1 and SEQ_2.

bioinf p-distance [OPTIONS] SEQ_1 SEQ_2

Arguments

SEQ_1

Required argument

SEQ_2

Required argument

random-sequence

Create a random sequence of the given length

Create a random sequence of the given LENGTH using the specified ALPHABET (nucleotides or aminoacids).

bioinf random-sequence [OPTIONS] LENGTH

Options

-a, --alphabet <alphabet>

Character alphabet to use to create the sequence (‘nt’, ‘aa’) (default: ‘nt’)

Options

nt|aa

Arguments

LENGTH

Required argument

reverse-complement

Convert a nucleotide sequence into its reverse complement

Convert a nucleotide SEQUENCE into its reverse, complement or reverse complement.

bioinf reverse-complement [OPTIONS] SEQUENCE

Options

-c, --conversion <conversion>

Type of conversion to perform (‘r’|’reverse’, ‘c’|’complement’, ‘rc’|’reverse_complement’) (default: ‘rc’|’reverse_complement’)

Options

reverse|complement|reverse_complement|r|c|rc

Arguments

SEQUENCE

Required argument

shuffle-sequence

Shuffle the given sequence

Randomly shuffle a SEQUENCE, maintaining the same nucleotide composition.

bioinf shuffle-sequence [OPTIONS] SEQUENCE

Arguments

SEQUENCE

Required argument

tajima-nei-distance

Tajima-Nei distance between two sequences

Return the Tajima-Nei distance between SEQ_1 and SEQ_2, calculated as distance = -b log(1 - p / b) where b = 0.5 * [1 - Sum i from A to T(Gi^2+p^2/h)] h = Sum i from A to G(Sum j from C to T (Xij^2/2*Gi*Gj)) p = p-distance Xij = frequency of pair (i,j) in SEQ_1 and SEQ_2, with gaps removed Gi = frequency of base i over SEQ_1 and SEQ_2

bioinf tajima-nei-distance [OPTIONS] SEQ_1 SEQ_2

Arguments

SEQ_1

Required argument

SEQ_2

Required argument

tamura-distance

Tamura distance between two sequences

Return the Tamura distance between SEQ_1 and SEQ_2, calculated as distance = -C log(1 - P/C - Q) - 0.5(1 - C)log(1 - 2Q) where P = transition frequency Q = transversion frequency C = GC1 + GC2 - 2 * GC1 * GC2 GC1 = GC-content of SEQ_1 GC2 = GC-content of SEQ_2

bioinf tamura-distance [OPTIONS] SEQ_1 SEQ_2

Arguments

SEQ_1

Required argument

SEQ_2

Required argument

prestools clustering

clustering

Data clustering utilities

clustering [OPTIONS] COMMAND [ARGS]...
find-n-clusters-elbow

Find the number of clusters using the elbow method

Find the suggested number of clusters for the given dataframe of correlations, using the elbow method.

clustering find-n-clusters-elbow [OPTIONS] DF

Options

-m, --method <method>

Method to be used to cluster the data [‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’] (default = ‘ward’)

Options

ward|single|complete|average|weighted|centroid|median

Arguments

DF

Required argument

prestools misc

misc

Miscellaneous utilities

misc [OPTIONS] COMMAND [ARGS]...
equal-files

Check whether two files are identical

First check whether FILE1 and FILE2 have the same size, if so read them and check their content for equality.

misc equal-files [OPTIONS] FILE1 FILE2

Arguments

FILE1

Required argument

FILE2

Required argument

prime-factors

Calculate the prime factors of a number

Calculate the prime factors of a given natural NUMBER. Note that 1 is not a prime number, so it will not be included.

misc prime-factors [OPTIONS] NUMBER

Arguments

NUMBER

Required argument

wordcount

Count occurrences of words in a sentence

Return the number of occurrences of each word in the given SENTENCE, in the form of a dictionary; it is also possible to directly return the number of occurrences of a specific WORD.

misc wordcount [OPTIONS] SENTENCE

Options

-w, --word <word>

Target word to count occurrences of

-i, --ignore_case

Ignore case in the given sentence (default: False)

Arguments

SENTENCE

Required argument