API¶
All functionalities of prestools can be accessed after having imported the desired module into Python, as in import prestools.bioinf as pb
and similars. In addition, some functions are also available as CommandLine Interface commands, but this should not be relied on.
Python module functions¶
prestools.bioinf¶

prestools.bioinf.
aa_one_to_three
(sequence: str) → str[source]¶ Convert oneletter amino acid code to threeletter code.
 Parameters
sequence – sequence of amino acids in oneletter code
 Returns
sequence converted to threeletter code
 Return type
new_seq

prestools.bioinf.
aa_three_to_one
(sequence: str) → str[source]¶ Convert threeletter amino acid code to oneletter code.
 Parameters
sequence – sequence of amino acids in threeletter code
 Returns
sequence converted to oneletter code
 Return type
new_seq

prestools.bioinf.
hamming_distance
(seq_1: str, seq_2: str, ignore_case: bool = False) → int[source]¶ Calculate the Hamming distance between two sequences.
 Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
ignore_case – ignore case when comparing sequences (default: False)
 Returns
Hamming distance
 Return type
distance

prestools.bioinf.
jukes_cantor_distance
(seq_1: str, seq_2: str) → float[source]¶ Calculate the JukesCantor distance between two sequences.
Return the JukesCantor distance between seq_1 and seq_2, calculated as distance = b log(1  p/b) where b = 3/4 and p = p_distance.
 Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
 Returns
JukesCantor distance
 Return type
distance

prestools.bioinf.
kimura_distance
(seq_1: str, seq_2: str) → float[source]¶ Calculate the Kimura 2Parameter distance between two sequences.
Return the Kimura 2Parameter distance between seq_1 and seq_2, calculated as distance = 0.5 log((1  2p q) * sqrt( 1  2q )) where p = transition frequency and q = transversion frequency.
 Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
 Returns
Kimura distance
 Return type
distance

prestools.bioinf.
mutate_sequence
(sequence: str, mutations: int = 1, alphabet: str = 'nt') → str[source]¶ Mutate a sequence introducing a given number of mutations.
Introduce a specific number of mutations into the given sequence.
 Parameters
sequence – input sequence to mutate
mutations – number of mutations to introduce (default: 1)
alphabet – character alphabet to use (‘nt’, ‘aa’) (default: ‘nt’)
 Returns
mutated sequence
 Return type
sequence

prestools.bioinf.
nt_frequency
(sequence: str) → Dict[str, float][source]¶ Calculate nucleotide frequencies.
Return a dictionary with nucleotide frequencies from the given sequence.
 Parameters
sequence – input nucleotide sequence
 Returns
dictionary of nucleotide frequencies
 Return type
freqs

prestools.bioinf.
p_distance
(seq_1: str, seq_2: str) → float[source]¶ Calculate the pairwise distance between two sequences.
Return the uncorrected distance between seq_1 and seq_2.
 Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
 Returns
pairwise distance
 Return type
distance

prestools.bioinf.
quantile_norm
(x: numpy.ndarray, to_log: bool = False) → numpy.ndarray[source]¶ Normalize the columns of X to each have the same distribution.
Given an expression matrix (microarray data, read counts, etc) of M genes by N samples, quantile normalization ensures all samples have the same spread of data (by construction).
The data across each row are averaged to obtain an average column. Each column quantile is replaced with the corresponding quantile of the average column.
 Parameters
x – array of input data, of shape (N_genes, N_samples)
to_log – logtransform the data before normalising (default: False)
 Returns
array of normalised data, of shape (N_genes, N_samples)
 Return type
xn

prestools.bioinf.
random_sequence
(length: Union[int, str], alphabet: str = 'nt') → str[source]¶ Create a random sequence of the given length.
Create a random sequence of the given length using the specified alphabet (nucleotides or amino acids).
 Parameters
length – desired length of the random sequence
alphabet – character alphabet to use (‘nt’, ‘aa’) (default: ‘nt’)
 Returns
new random sequence
 Return type
sequence

prestools.bioinf.
reverse_complement
(sequence: str, conversion: str = 'reverse_complement') → str[source]¶ Convert a nucleotide sequence into its reverse complement.
Convert a nucleotide sequence into its reverse, complement or reverse complement.
 Parameters
sequence – nucleotide sequence to be converted
conversion – type of conversion to perform (‘r’’reverse’, ‘c’’complement’, ‘rc’’reverse_complement’) (default: ‘rc’’reverse_complement’)
 Returns
converted sequence

prestools.bioinf.
rpkm
(counts: numpy.ndarray, lengths: numpy.ndarray) → numpy.ndarray[source]¶ Calculate reads per kilobase transcript per million reads.
RPKM = (10^9 * C) / (N * L)
Where: C = Number of reads mapped to a gene N = Total mapped reads in the experiment L = Exon length in base pairs for a gene
 Parameters
counts – count data where columns are individual samples and rows are genes, of shape (N_genes, N_samples)
lengths – gene lengths in base pairs in the same order as the rows in counts, of shape (N_genes, )
 Returns
 RPKM normalized counts matrix, of
shape (N_genes, N_samples)
 Return type
normed

prestools.bioinf.
shuffle_sequence
(sequence: str) → str[source]¶ Shuffle the given sequence.
Randomly shuffle a sequence, maintaining the same composition.
 Parameters
sequence – input sequence to shuffle
 Returns
shuffled sequence
 Return type
tmp_seq

prestools.bioinf.
tajima_nei_distance
(seq_1: str, seq_2: str) → float[source]¶ Calculate the TajimaNei distance between two sequences.
Return the TajimaNei distance between seq_1 and seq_2, calculated as distance = b log(1  p / b) where b = 0.5 * [1  Sum i from A to T(Gi^2+p^2/h)] h = Sum i from A to G(Sum j from C to T (Xij^2/2*Gi*Gj)) p = pdistance Xij = frequency of pair (i,j) in seq1 and seq2, with gaps removed Gi = frequency of base i over seq1 and seq2
 Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
 Returns
TajimaNei distance
 Return type
distance

prestools.bioinf.
tamura_distance
(seq_1: str, seq_2: str) → float[source]¶ Calculate the Tamura distance between two sequences.
Return the Tamura distance between seq_1 and seq_2, calculated as distance = C log(1  P/C  Q)  0.5(1  C)log(1  2Q) where P = transition frequency Q = transversion frequency C = GC1 + GC2  2 * GC1 * GC2 GC1 = GCcontent of seq_1 GC2 = GCcontent of seq_2
 Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
 Returns
Tamura distance
 Return type
distance
prestools.clustering¶

prestools.clustering.
find_n_clusters_elbow
(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], plot: bool = False, method: str = 'ward') → Union[int, None, ValueError][source]¶ Find the suggested number of clusters using the elbow method.
Find the suggested number of clusters for the given dataframe of correlations, using the elbow method.
 Parameters
df – input dataframe of correlations
plot – plot the resulting elbow plot (default: False)
method – method to use to cluster the data (‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’) (default: ‘ward’)
 Returns
number of clusters found
 Return type
n_clusters

prestools.clustering.
hierarchical_clustering
(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], method: str = 'ward') → Union[prestools.classes.HierCluster, None, ValueError][source]¶ Hierarchical cluster of a dataframe.
Return clustering created using scipy from a given dataframe of correlations, using the HierCluster class available in prestools.classes.
 Parameters
df – input dataframe of correlations
method – method to use to cluster the data (‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’) (default: ‘ward’)
 Returns
instance of prestools.classes.HierCluster()
 Return type
cl
prestools.graph¶

prestools.graph.
flatten_image
(img: numpy.ndarray, scale: bool = False) → numpy.ndarray[source]¶ Convert an image array to a singledimension vector.
 Parameters
img – input image array of shape (l, h, d = 3)
scale – scale resulting vector dividing its values by 255 (default: False)
 Returns
reshaped vector of shape (l * h * d, 1)
 Return type
v

prestools.graph.
plot_confusion_matrix
(cm: numpy.ndarray, class_names: List[str], title: str = 'Confusion Matrix', cmap: str = 'Reds', normalize: bool = False, save: Union[bool, str] = False)[source]¶ Create a plot from a confusion matrix array.
 Parameters
cm – input confusion matrix array
class_names – class names to use
title – title for resulting plot (default: ‘Confusion Matrix’)
cmap – colormap to use (default: ‘RdBu_r’)
normalize – use classes ratios instead of raw numbers (default: False)
save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)

prestools.graph.
plot_dendrogram
(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], cut_off: Union[bool, float] = False, title: str = 'Dendrogram', save: Union[bool, str] = False, method: str = 'ward')[source]¶ Plot a dendrogram plot from a dataframe.
Create (and optionally save) a dendrogram plot starting from a given dataframe of correlations. It is also possible to add a cutoff line given a distance to use for separating clusters.
 Parameters
df – input dataframe of correlations
cut_off – if not False, a vertical line will be added to better identify clusters (default: False)
title – title for resulting plot (default: ‘Dendrogram’)
save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)
method – method to use to cluster the data (default: ‘ward’)

prestools.graph.
plot_heatmap_dendrogram
(df: pandas.core.frame.DataFrame, cmap: str = 'RdBu_r', title: str = 'Cluster Heatmap', save: Union[bool, str] = False, method: str = 'ward')[source]¶ Plot a heatmap with hierarchical clustering of a dataframe.
Create (and optionally save) a heatmap with hierarchical clustering created using Seaborn, starting from a given dataframe of correlations.
 Parameters
df – input dataframe of correlations
cmap – colormap to use (default: ‘RdBu_r’)
title – title for resulting plot (default: ‘Cluster Heatmap’)
save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)
method – method to use to cluster the data (default: ‘ward’)
prestools.misc¶

prestools.misc.
apply_parallel
(df: pandas.core.frame.DataFrame, function: Callable, cores: int = 4) → pandas.core.frame.DataFrame[source]¶ Apply a function to a dataframe in parallel.
Apply the given function to the dataframe, using the given number of cores for computation. The dataframe will be split in cores part, and the function will be applied to each separately; finally, the dataframe is reconstructed and returned.
 Parameters
df – input dataframe
function – function to apply
cores – number of cores to use (default: 4)
 Returns
resulting dataframe
 Return type
df

prestools.misc.
benchmark
(function: Callable) → Callable[source]¶ Benchmark a given function.
Decorator to run the given function and return the function name and the amount of time spent in executing it.
 Parameters
function – function to benchmark

prestools.misc.
equal_files
(file1: str, file2: str) → bool[source]¶ Check whether two files are identical.
First check whether the files have the same size, if so read them and check their content for equality.
 Parameters
file1 – first file to compare
file2 – second file to compare

prestools.misc.
filter_type
(input_list: List[Any], target_type: Type) → List[Any][source]¶ Only keep elements of a given type from a list of elements.
Traverse a list and return a new list with only elements of the original list belonging to a given type.
 Parameters
input_list – input list to filter
target_type – desired type to keep
 Returns
filtered list
 Return type
filtered

prestools.misc.
flatten
(iterable: Iterable, drop_null: bool = False) → List[Any][source]¶ Flatten out a nested iterable.
Flatten a nested iterable, even with multiple nesting levels and different data types. It is also possible to drop null values (None) from the resulting list.
 Parameters
iterable – nested iterable to flatten
drop_null – filter out None from the flattened list (default: False)
 Returns
flat list

prestools.misc.
invert_dict
(input_dict: dict, sort_keys: bool = False) → dict[source]¶ Create a new dictionary swapping keys and values.
Invert a given dictionary, creating a new dictionary where each key is created from a value of the original dictionary, and its value is the key that it was associated to in the original dictionary (e.g. invert_dict({1: [“A”, “E”], 2: [“D”, “G”]}) = {“A”: 1, “E”: 1, “D”: 2, “G”: 2}). It is also possible to return an inverted dictionary with keys in alphabetical order, although this makes little sense for intrinsically unordered data structures like dictionaries, but it may be useful when printing the results.
 Parameters
input_dict – original dictionary to be inverted
sort_keys – sort the keys in the inverted dictionary in alphabetical order (default: False)
 Returns
inverted dictionary
 Return type
new_dict

prestools.misc.
prime_factors
(number: int) → List[int][source]¶ Calculate the prime factors of a number.
Calculate the prime factors of a given natural number. Note that 1 is not a prime number, so it will not be included.
 Parameters
number – input natural number
 Returns
list of prime factors
 Return type
factors

prestools.misc.
wordcount
(sentence: str, word: Union[bool, str] = False, ignore_case: bool = False) → Union[dict, int][source]¶ Count occurrences of words in a sentence.
Return the number of occurrences of each word in the given sentence, in the form of a dictionary; it is also possible to directly return the number of occurrences of a specific word.
 Parameters
sentence – input sentence to count words from
word – target word to count occurrences of
ignore_case – ignore case in the given sentence (default: False)
 Returns
dictionary of word counts
 Return type
word_dict
Command Line Interface¶
prestools bioinf¶
bioinf¶
Bioinformatics utilities
bioinf [OPTIONS] COMMAND [ARGS]...
hammingdistance¶
Hamming distance between two sequences
Calculate the Hamming distance between SEQ_1 and SEQ_2.
bioinf hammingdistance [OPTIONS] SEQ_1 SEQ_2
Options

i
,
ignore_case
¶
Ignore case when comparing sequences (default: False)
Arguments

SEQ_1
¶
Required argument

SEQ_2
¶
Required argument
jukescantordistance¶
JukesCantor distance between two sequences
Return the JukesCantor distance between SEQ_1 and SEQ_2, calculated as distance = b log(1  p/b) where b = 3/4 and p = p_distance.
bioinf jukescantordistance [OPTIONS] SEQ_1 SEQ_2
Arguments

SEQ_1
¶
Required argument

SEQ_2
¶
Required argument
kimuradistance¶
Kimura 2Parameter distance between two sequences
Return the Kimura 2Parameter distance between SEQ_1 and SEQ_2, calculated as distance = 0.5 log((1  2p q) * sqrt( 1  2q )) where p = transition frequency and q = transversion frequency.
bioinf kimuradistance [OPTIONS] SEQ_1 SEQ_2
Arguments

SEQ_1
¶
Required argument

SEQ_2
¶
Required argument
pdistance¶
Pairwise distance between two sequences
Return the uncorrected distance between SEQ_1 and SEQ_2.
bioinf pdistance [OPTIONS] SEQ_1 SEQ_2
Arguments

SEQ_1
¶
Required argument

SEQ_2
¶
Required argument
randomsequence¶
Create a random sequence of the given length
Create a random sequence of the given LENGTH using the specified ALPHABET (nucleotides or aminoacids).
bioinf randomsequence [OPTIONS] LENGTH
Options

a
,
alphabet
<alphabet>
¶ Character alphabet to use to create the sequence (‘nt’, ‘aa’) (default: ‘nt’)
 Options
ntaa
Arguments

LENGTH
¶
Required argument
reversecomplement¶
Convert a nucleotide sequence into its reverse complement
Convert a nucleotide SEQUENCE into its reverse, complement or reverse complement.
bioinf reversecomplement [OPTIONS] SEQUENCE
Options

c
,
conversion
<conversion>
¶ Type of conversion to perform (‘r’’reverse’, ‘c’’complement’, ‘rc’’reverse_complement’) (default: ‘rc’’reverse_complement’)
 Options
reversecomplementreverse_complementrcrc
Arguments

SEQUENCE
¶
Required argument
shufflesequence¶
Shuffle the given sequence
Randomly shuffle a SEQUENCE, maintaining the same nucleotide composition.
bioinf shufflesequence [OPTIONS] SEQUENCE
Arguments

SEQUENCE
¶
Required argument
tajimaneidistance¶
TajimaNei distance between two sequences
Return the TajimaNei distance between SEQ_1 and SEQ_2, calculated as distance = b log(1  p / b) where b = 0.5 * [1  Sum i from A to T(Gi^2+p^2/h)] h = Sum i from A to G(Sum j from C to T (Xij^2/2*Gi*Gj)) p = pdistance Xij = frequency of pair (i,j) in SEQ_1 and SEQ_2, with gaps removed Gi = frequency of base i over SEQ_1 and SEQ_2
bioinf tajimaneidistance [OPTIONS] SEQ_1 SEQ_2
Arguments

SEQ_1
¶
Required argument

SEQ_2
¶
Required argument
tamuradistance¶
Tamura distance between two sequences
Return the Tamura distance between SEQ_1 and SEQ_2, calculated as distance = C log(1  P/C  Q)  0.5(1  C)log(1  2Q) where P = transition frequency Q = transversion frequency C = GC1 + GC2  2 * GC1 * GC2 GC1 = GCcontent of SEQ_1 GC2 = GCcontent of SEQ_2
bioinf tamuradistance [OPTIONS] SEQ_1 SEQ_2
Arguments

SEQ_1
¶
Required argument

SEQ_2
¶
Required argument
prestools clustering¶
clustering¶
Data clustering utilities
clustering [OPTIONS] COMMAND [ARGS]...
findnclusterselbow¶
Find the number of clusters using the elbow method
Find the suggested number of clusters for the given dataframe of correlations, using the elbow method.
clustering findnclusterselbow [OPTIONS] DF
Options

m
,
method
<method>
¶ Method to be used to cluster the data [‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’] (default = ‘ward’)
 Options
wardsinglecompleteaverageweightedcentroidmedian
Arguments

DF
¶
Required argument
prestools misc¶
misc¶
Miscellaneous utilities
misc [OPTIONS] COMMAND [ARGS]...
equalfiles¶
Check whether two files are identical
First check whether FILE1 and FILE2 have the same size, if so read them and check their content for equality.
misc equalfiles [OPTIONS] FILE1 FILE2
Arguments

FILE1
¶
Required argument

FILE2
¶
Required argument
primefactors¶
Calculate the prime factors of a number
Calculate the prime factors of a given natural NUMBER. Note that 1 is not a prime number, so it will not be included.
misc primefactors [OPTIONS] NUMBER
Arguments

NUMBER
¶
Required argument
wordcount¶
Count occurrences of words in a sentence
Return the number of occurrences of each word in the given SENTENCE, in the form of a dictionary; it is also possible to directly return the number of occurrences of a specific WORD.
misc wordcount [OPTIONS] SENTENCE
Options

w
,
word
<word>
¶ Target word to count occurrences of

i
,
ignore_case
¶
Ignore case in the given sentence (default: False)
Arguments

SENTENCE
¶
Required argument