API¶

All functionalities of prestools can be accessed after having imported the desired module into Python, as in import prestools.bioinf as pb and similars. In addition, some functions are also available as Command-Line Interface commands, but this should not be relied on.

Python module functions¶

prestools.bioinf¶

prestools.bioinf.aa_one_to_three(sequence: str) → str[source]¶

Convert one-letter amino acid code to three-letter code.

Parameters: sequence – sequence of amino acids in one-letter code
Returns: sequence converted to three-letter code
Return type: new_seq

prestools.bioinf.aa_three_to_one(sequence: str) → str[source]¶

Convert three-letter amino acid code to one-letter code.

Parameters: sequence – sequence of amino acids in three-letter code
Returns: sequence converted to one-letter code
Return type: new_seq

prestools.bioinf.hamming_distance(seq_1: str, seq_2: str, ignore_case: bool = False) → int[source]¶

Calculate the Hamming distance between two sequences.

Parameters

seq_1 – first sequence to compare
seq_2 – second sequence to compare
ignore_case – ignore case when comparing sequences (default: False)

Returns

Hamming distance

Return type

distance

prestools.bioinf.jukes_cantor_distance(seq_1: str, seq_2: str) → float[source]¶

Calculate the Jukes-Cantor distance between two sequences.

Return the Jukes-Cantor distance between seq_1 and seq_2, calculated as distance = -b log(1 - p/b) where b = 3/4 and p = p_distance.

Parameters

seq_1 – first sequence to compare
seq_2 – second sequence to compare

Returns

Jukes-Cantor distance

Return type

distance

prestools.bioinf.kimura_distance(seq_1: str, seq_2: str) → float[source]¶

Calculate the Kimura 2-Parameter distance between two sequences.

Return the Kimura 2-Parameter distance between seq_1 and seq_2, calculated as distance = -0.5 log((1 - 2p -q) * sqrt( 1 - 2q )) where p = transition frequency and q = transversion frequency.

Parameters

seq_1 – first sequence to compare
seq_2 – second sequence to compare

Returns

Kimura distance

Return type

distance

prestools.bioinf.mutate_sequence(sequence: str, mutations: int = 1, alphabet: str = 'nt') → str[source]¶

Mutate a sequence introducing a given number of mutations.

Introduce a specific number of mutations into the given sequence.

Parameters

sequence – input sequence to mutate
mutations – number of mutations to introduce (default: 1)
alphabet – character alphabet to use (‘nt’, ‘aa’) (default: ‘nt’)

Returns

mutated sequence

Return type

sequence

prestools.bioinf.nt_frequency(sequence: str) → Dict[str, float][source]¶

Calculate nucleotide frequencies.

Return a dictionary with nucleotide frequencies from the given sequence.

Parameters: sequence – input nucleotide sequence
Returns: dictionary of nucleotide frequencies
Return type: freqs

prestools.bioinf.p_distance(seq_1: str, seq_2: str) → float[source]¶

Calculate the pairwise distance between two sequences.

Return the uncorrected distance between seq_1 and seq_2.

Parameters

seq_1 – first sequence to compare
seq_2 – second sequence to compare

Returns

pairwise distance

Return type

distance

prestools.bioinf.quantile_norm(x: numpy.ndarray, to_log: bool = False) → numpy.ndarray[source]¶

Normalize the columns of X to each have the same distribution.

Given an expression matrix (microarray data, read counts, etc) of M genes by N samples, quantile normalization ensures all samples have the same spread of data (by construction).

The data across each row are averaged to obtain an average column. Each column quantile is replaced with the corresponding quantile of the average column.

Parameters

x – array of input data, of shape (N_genes, N_samples)
to_log – log-transform the data before normalising (default: False)

Returns

array of normalised data, of shape (N_genes, N_samples)

Return type

xn

prestools.bioinf.random_sequence(length: Union[int, str], alphabet: str = 'nt') → str[source]¶

Create a random sequence of the given length.

Create a random sequence of the given length using the specified alphabet (nucleotides or amino acids).

Parameters

length – desired length of the random sequence
alphabet – character alphabet to use (‘nt’, ‘aa’) (default: ‘nt’)

Returns

new random sequence

Return type

sequence

prestools.bioinf.reverse_complement(sequence: str, conversion: str = 'reverse_complement') → str[source]¶

Convert a nucleotide sequence into its reverse complement.

Convert a nucleotide sequence into its reverse, complement or reverse complement.

Parameters

sequence – nucleotide sequence to be converted
conversion – type of conversion to perform (‘r’|’reverse’, ‘c’|’complement’, ‘rc’|’reverse_complement’) (default: ‘rc’|’reverse_complement’)

Returns

converted sequence

prestools.bioinf.rpkm(counts: numpy.ndarray, lengths: numpy.ndarray) → numpy.ndarray[source]¶

Calculate reads per kilobase transcript per million reads.

RPKM = (10^9 * C) / (N * L)

Where: C = Number of reads mapped to a gene N = Total mapped reads in the experiment L = Exon length in base pairs for a gene

Parameters

counts – count data where columns are individual samples and rows are genes, of shape (N_genes, N_samples)
lengths – gene lengths in base pairs in the same order as the rows in counts, of shape (N_genes, )

Returns

RPKM normalized counts matrix, of: shape (N_genes, N_samples)

Return type

normed

prestools.bioinf.shuffle_sequence(sequence: str) → str[source]¶

Shuffle the given sequence.

Randomly shuffle a sequence, maintaining the same composition.

Parameters: sequence – input sequence to shuffle
Returns: shuffled sequence
Return type: tmp_seq

prestools.bioinf.tajima_nei_distance(seq_1: str, seq_2: str) → float[source]¶

Calculate the Tajima-Nei distance between two sequences.

Return the Tajima-Nei distance between seq_1 and seq_2, calculated as distance = -b log(1 - p / b) where b = 0.5 * [1 - Sum i from A to T(Gi^2+p^2/h)] h = Sum i from A to G(Sum j from C to T (Xij^2/2*Gi*Gj)) p = p-distance Xij = frequency of pair (i,j) in seq1 and seq2, with gaps removed Gi = frequency of base i over seq1 and seq2

Parameters

seq_1 – first sequence to compare
seq_2 – second sequence to compare

Returns

Tajima-Nei distance

Return type

distance

prestools.bioinf.tamura_distance(seq_1: str, seq_2: str) → float[source]¶

Calculate the Tamura distance between two sequences.

Return the Tamura distance between seq_1 and seq_2, calculated as distance = -C log(1 - P/C - Q) - 0.5(1 - C)log(1 - 2Q) where P = transition frequency Q = transversion frequency C = GC1 + GC2 - 2 * GC1 * GC2 GC1 = GC-content of seq_1 GC2 = GC-content of seq_2

Parameters

seq_1 – first sequence to compare
seq_2 – second sequence to compare

Returns

Tamura distance

Return type

distance

prestools.clustering¶

prestools.clustering.find_n_clusters_elbow(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], plot: bool = False, method: str = 'ward') → Union[int, None, ValueError][source]¶

Find the suggested number of clusters using the elbow method.

Find the suggested number of clusters for the given dataframe of correlations, using the elbow method.

Parameters

df – input dataframe of correlations
plot – plot the resulting elbow plot (default: False)
method – method to use to cluster the data (‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’) (default: ‘ward’)

Returns

number of clusters found

Return type

n_clusters

prestools.clustering.hierarchical_clustering(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], method: str = 'ward') → Union[prestools.classes.HierCluster, None, ValueError][source]¶

Hierarchical cluster of a dataframe.

Return clustering created using scipy from a given dataframe of correlations, using the HierCluster class available in prestools.classes.

Parameters

df – input dataframe of correlations
method – method to use to cluster the data (‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’) (default: ‘ward’)

Returns

instance of prestools.classes.HierCluster()

Return type

cl

prestools.graph¶

prestools.graph.flatten_image(img: numpy.ndarray, scale: bool = False) → numpy.ndarray[source]¶

Convert an image array to a single-dimension vector.

Parameters

img – input image array of shape (l, h, d = 3)
scale – scale resulting vector dividing its values by 255 (default: False)

Returns

reshaped vector of shape (l * h * d, 1)

Return type

v

prestools.graph.plot_confusion_matrix(cm: numpy.ndarray, class_names: List[str], title: str = 'Confusion Matrix', cmap: str = 'Reds', normalize: bool = False, save: Union[bool, str] = False)[source]¶

Create a plot from a confusion matrix array.

Parameters

cm – input confusion matrix array
class_names – class names to use
title – title for resulting plot (default: ‘Confusion Matrix’)
cmap – colormap to use (default: ‘RdBu_r’)
normalize – use classes ratios instead of raw numbers (default: False)
save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)

prestools.graph.plot_dendrogram(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], cut_off: Union[bool, float] = False, title: str = 'Dendrogram', save: Union[bool, str] = False, method: str = 'ward')[source]¶

Plot a dendrogram plot from a dataframe.

Create (and optionally save) a dendrogram plot starting from a given dataframe of correlations. It is also possible to add a cut-off line given a distance to use for separating clusters.

Parameters

df – input dataframe of correlations
cut_off – if not False, a vertical line will be added to better identify clusters (default: False)
title – title for resulting plot (default: ‘Dendrogram’)
save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)
method – method to use to cluster the data (default: ‘ward’)

prestools.graph.plot_heatmap_dendrogram(df: pandas.core.frame.DataFrame, cmap: str = 'RdBu_r', title: str = 'Cluster Heatmap', save: Union[bool, str] = False, method: str = 'ward')[source]¶

Plot a heatmap with hierarchical clustering of a dataframe.

Create (and optionally save) a heatmap with hierarchical clustering created using Seaborn, starting from a given dataframe of correlations.

Parameters

df – input dataframe of correlations
cmap – colormap to use (default: ‘RdBu_r’)
title – title for resulting plot (default: ‘Cluster Heatmap’)
save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)
method – method to use to cluster the data (default: ‘ward’)

prestools.graph.reduce_xaxis_ticks(ax: matplotlib.axes._axes.Axes, step: int)[source]¶

Show every ith x axis tick.

Parameters

ax – axis to be adjusted
step – factor to reduce the number of x axis ticks by

Examples

>>> fig, ax = plt.subplots()
>>> reduce_xaxis_ticks(ax, 5)

prestools.graph.reduce_yaxis_ticks(ax: matplotlib.axes._axes.Axes, step: int)[source]¶

Show every ith y axis tick.

Parameters

ax – axis to be adjusted
step – factor to reduce the number of y axis ticks by

Examples

>>> fig, ax = plt.subplots()
>>> reduce_yaxis_ticks(ax, 5)

prestools.misc¶

prestools.misc.apply_parallel(df: pandas.core.frame.DataFrame, function: Callable, cores: int = 4) → pandas.core.frame.DataFrame[source]¶

Apply a function to a dataframe in parallel.

Apply the given function to the dataframe, using the given number of cores for computation. The dataframe will be split in cores part, and the function will be applied to each separately; finally, the dataframe is reconstructed and returned.

Parameters

df – input dataframe
function – function to apply
cores – number of cores to use (default: 4)

Returns

resulting dataframe

Return type

df

prestools.misc.benchmark(function: Callable) → Callable[source]¶

Benchmark a given function.

Decorator to run the given function and return the function name and the amount of time spent in executing it.

Parameters: function – function to benchmark

prestools.misc.equal_files(file1: str, file2: str) → bool[source]¶

Check whether two files are identical.

First check whether the files have the same size, if so read them and check their content for equality.

Parameters

file1 – first file to compare
file2 – second file to compare

prestools.misc.filter_type(input_list: List[Any], target_type: Type) → List[Any][source]¶

Only keep elements of a given type from a list of elements.

Traverse a list and return a new list with only elements of the original list belonging to a given type.

Parameters

input_list – input list to filter
target_type – desired type to keep

Returns

filtered list

Return type

filtered

prestools.misc.flatten(iterable: Iterable, drop_null: bool = False) → List[Any][source]¶

Flatten out a nested iterable.

Flatten a nested iterable, even with multiple nesting levels and different data types. It is also possible to drop null values (None) from the resulting list.

Parameters

iterable – nested iterable to flatten
drop_null – filter out None from the flattened list (default: False)

Returns

flat list

prestools.misc.invert_dict(input_dict: dict, sort_keys: bool = False) → dict[source]¶

Create a new dictionary swapping keys and values.

Invert a given dictionary, creating a new dictionary where each key is created from a value of the original dictionary, and its value is the key that it was associated to in the original dictionary (e.g. invert_dict({1: [“A”, “E”], 2: [“D”, “G”]}) = {“A”: 1, “E”: 1, “D”: 2, “G”: 2}). It is also possible to return an inverted dictionary with keys in alphabetical order, although this makes little sense for intrinsically unordered data structures like dictionaries, but it may be useful when printing the results.

Parameters

input_dict – original dictionary to be inverted
sort_keys – sort the keys in the inverted dictionary in alphabetical order (default: False)

Returns

inverted dictionary

Return type

new_dict

prestools.misc.prime_factors(number: int) → List[int][source]¶

Calculate the prime factors of a number.

Calculate the prime factors of a given natural number. Note that 1 is not a prime number, so it will not be included.

Parameters: number – input natural number
Returns: list of prime factors
Return type: factors

prestools.misc.wordcount(sentence: str, word: Union[bool, str] = False, ignore_case: bool = False) → Union[dict, int][source]¶

Count occurrences of words in a sentence.

Return the number of occurrences of each word in the given sentence, in the form of a dictionary; it is also possible to directly return the number of occurrences of a specific word.

Parameters

sentence – input sentence to count words from
word – target word to count occurrences of
ignore_case – ignore case in the given sentence (default: False)

Returns

dictionary of word counts

Return type

word_dict

Command Line Interface¶

prestools bioinf¶

bioinf¶

Bioinformatics utilities

bioinf [OPTIONS] COMMAND [ARGS]...

hamming-distance¶

Hamming distance between two sequences

Calculate the Hamming distance between SEQ_1 and SEQ_2.

bioinf hamming-distance [OPTIONS] SEQ_1 SEQ_2

Options

-i, --ignore_case¶: Ignore case when comparing sequences (default: False)

Arguments

SEQ_1¶: Required argument

SEQ_2¶: Required argument

jukes-cantor-distance¶

Jukes-Cantor distance between two sequences

Return the Jukes-Cantor distance between SEQ_1 and SEQ_2, calculated as distance = -b log(1 - p/b) where b = 3/4 and p = p_distance.

bioinf jukes-cantor-distance [OPTIONS] SEQ_1 SEQ_2

Arguments

SEQ_1¶: Required argument

SEQ_2¶: Required argument

kimura-distance¶

Kimura 2-Parameter distance between two sequences

Return the Kimura 2-Parameter distance between SEQ_1 and SEQ_2, calculated as distance = -0.5 log((1 - 2p -q) * sqrt( 1 - 2q )) where p = transition frequency and q = transversion frequency.

bioinf kimura-distance [OPTIONS] SEQ_1 SEQ_2

Arguments

SEQ_1¶: Required argument

SEQ_2¶: Required argument

p-distance¶

Pairwise distance between two sequences

Return the uncorrected distance between SEQ_1 and SEQ_2.

bioinf p-distance [OPTIONS] SEQ_1 SEQ_2

Arguments

SEQ_1¶: Required argument

SEQ_2¶: Required argument

random-sequence¶

Create a random sequence of the given length

Create a random sequence of the given LENGTH using the specified ALPHABET (nucleotides or aminoacids).

bioinf random-sequence [OPTIONS] LENGTH

Options

-a, --alphabet <alphabet>¶

Character alphabet to use to create the sequence (‘nt’, ‘aa’) (default: ‘nt’)

Options: nt|aa

Arguments

LENGTH¶: Required argument

reverse-complement¶

Convert a nucleotide sequence into its reverse complement

Convert a nucleotide SEQUENCE into its reverse, complement or reverse complement.

bioinf reverse-complement [OPTIONS] SEQUENCE

Options

-c, --conversion <conversion>¶

Type of conversion to perform (‘r’|’reverse’, ‘c’|’complement’, ‘rc’|’reverse_complement’) (default: ‘rc’|’reverse_complement’)

Options: reverse|complement|reverse_complement|r|c|rc

Arguments

SEQUENCE¶: Required argument

shuffle-sequence¶

Shuffle the given sequence

Randomly shuffle a SEQUENCE, maintaining the same nucleotide composition.

bioinf shuffle-sequence [OPTIONS] SEQUENCE

Arguments

SEQUENCE¶: Required argument

tajima-nei-distance¶

Tajima-Nei distance between two sequences

Return the Tajima-Nei distance between SEQ_1 and SEQ_2, calculated as distance = -b log(1 - p / b) where b = 0.5 * [1 - Sum i from A to T(Gi^2+p^2/h)] h = Sum i from A to G(Sum j from C to T (Xij^2/2*Gi*Gj)) p = p-distance Xij = frequency of pair (i,j) in SEQ_1 and SEQ_2, with gaps removed Gi = frequency of base i over SEQ_1 and SEQ_2

bioinf tajima-nei-distance [OPTIONS] SEQ_1 SEQ_2

Arguments

SEQ_1¶: Required argument

SEQ_2¶: Required argument

tamura-distance¶

Tamura distance between two sequences

Return the Tamura distance between SEQ_1 and SEQ_2, calculated as distance = -C log(1 - P/C - Q) - 0.5(1 - C)log(1 - 2Q) where P = transition frequency Q = transversion frequency C = GC1 + GC2 - 2 * GC1 * GC2 GC1 = GC-content of SEQ_1 GC2 = GC-content of SEQ_2

bioinf tamura-distance [OPTIONS] SEQ_1 SEQ_2

Arguments

SEQ_1¶: Required argument

SEQ_2¶: Required argument

prestools clustering¶

clustering¶

Data clustering utilities

clustering [OPTIONS] COMMAND [ARGS]...

find-n-clusters-elbow¶

Find the number of clusters using the elbow method

Find the suggested number of clusters for the given dataframe of correlations, using the elbow method.

clustering find-n-clusters-elbow [OPTIONS] DF

Options

-m, --method <method>¶

Method to be used to cluster the data [‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’] (default = ‘ward’)

Options: ward|single|complete|average|weighted|centroid|median

Arguments

DF¶: Required argument

prestools misc¶

misc¶

Miscellaneous utilities

misc [OPTIONS] COMMAND [ARGS]...

equal-files¶

Check whether two files are identical

First check whether FILE1 and FILE2 have the same size, if so read them and check their content for equality.

misc equal-files [OPTIONS] FILE1 FILE2

Arguments

FILE1¶: Required argument

FILE2¶: Required argument

prime-factors¶

Calculate the prime factors of a number

Calculate the prime factors of a given natural NUMBER. Note that 1 is not a prime number, so it will not be included.

misc prime-factors [OPTIONS] NUMBER

Arguments

NUMBER¶: Required argument

wordcount¶

Count occurrences of words in a sentence

Return the number of occurrences of each word in the given SENTENCE, in the form of a dictionary; it is also possible to directly return the number of occurrences of a specific WORD.

misc wordcount [OPTIONS] SENTENCE

Options

-w, --word <word>¶: Target word to count occurrences of

-i, --ignore_case¶: Ignore case in the given sentence (default: False)

Arguments

SENTENCE¶: Required argument