API¶
All functionalities of prestools can be accessed after having imported the desired module into Python, as in import prestools.bioinf as pb
and similars. In addition, some functions are also available as Command-Line Interface commands, but this should not be relied on.
Python module functions¶
prestools.bioinf¶
-
prestools.bioinf.
aa_one_to_three
(sequence: str) → str[source]¶ Convert one-letter amino acid code to three-letter code.
- Parameters
sequence – sequence of amino acids in one-letter code
- Returns
sequence converted to three-letter code
- Return type
new_seq
-
prestools.bioinf.
aa_three_to_one
(sequence: str) → str[source]¶ Convert three-letter amino acid code to one-letter code.
- Parameters
sequence – sequence of amino acids in three-letter code
- Returns
sequence converted to one-letter code
- Return type
new_seq
-
prestools.bioinf.
hamming_distance
(seq_1: str, seq_2: str, ignore_case: bool = False) → int[source]¶ Calculate the Hamming distance between two sequences.
- Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
ignore_case – ignore case when comparing sequences (default: False)
- Returns
Hamming distance
- Return type
distance
-
prestools.bioinf.
jukes_cantor_distance
(seq_1: str, seq_2: str) → float[source]¶ Calculate the Jukes-Cantor distance between two sequences.
Return the Jukes-Cantor distance between seq_1 and seq_2, calculated as distance = -b log(1 - p/b) where b = 3/4 and p = p_distance.
- Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
- Returns
Jukes-Cantor distance
- Return type
distance
-
prestools.bioinf.
kimura_distance
(seq_1: str, seq_2: str) → float[source]¶ Calculate the Kimura 2-Parameter distance between two sequences.
Return the Kimura 2-Parameter distance between seq_1 and seq_2, calculated as distance = -0.5 log((1 - 2p -q) * sqrt( 1 - 2q )) where p = transition frequency and q = transversion frequency.
- Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
- Returns
Kimura distance
- Return type
distance
-
prestools.bioinf.
mutate_sequence
(sequence: str, mutations: int = 1, alphabet: str = 'nt') → str[source]¶ Mutate a sequence introducing a given number of mutations.
Introduce a specific number of mutations into the given sequence.
- Parameters
sequence – input sequence to mutate
mutations – number of mutations to introduce (default: 1)
alphabet – character alphabet to use (‘nt’, ‘aa’) (default: ‘nt’)
- Returns
mutated sequence
- Return type
sequence
-
prestools.bioinf.
nt_frequency
(sequence: str) → Dict[str, float][source]¶ Calculate nucleotide frequencies.
Return a dictionary with nucleotide frequencies from the given sequence.
- Parameters
sequence – input nucleotide sequence
- Returns
dictionary of nucleotide frequencies
- Return type
freqs
-
prestools.bioinf.
p_distance
(seq_1: str, seq_2: str) → float[source]¶ Calculate the pairwise distance between two sequences.
Return the uncorrected distance between seq_1 and seq_2.
- Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
- Returns
pairwise distance
- Return type
distance
-
prestools.bioinf.
quantile_norm
(x: numpy.ndarray, to_log: bool = False) → numpy.ndarray[source]¶ Normalize the columns of X to each have the same distribution.
Given an expression matrix (microarray data, read counts, etc) of M genes by N samples, quantile normalization ensures all samples have the same spread of data (by construction).
The data across each row are averaged to obtain an average column. Each column quantile is replaced with the corresponding quantile of the average column.
- Parameters
x – array of input data, of shape (N_genes, N_samples)
to_log – log-transform the data before normalising (default: False)
- Returns
array of normalised data, of shape (N_genes, N_samples)
- Return type
xn
-
prestools.bioinf.
random_sequence
(length: Union[int, str], alphabet: str = 'nt') → str[source]¶ Create a random sequence of the given length.
Create a random sequence of the given length using the specified alphabet (nucleotides or amino acids).
- Parameters
length – desired length of the random sequence
alphabet – character alphabet to use (‘nt’, ‘aa’) (default: ‘nt’)
- Returns
new random sequence
- Return type
sequence
-
prestools.bioinf.
reverse_complement
(sequence: str, conversion: str = 'reverse_complement') → str[source]¶ Convert a nucleotide sequence into its reverse complement.
Convert a nucleotide sequence into its reverse, complement or reverse complement.
- Parameters
sequence – nucleotide sequence to be converted
conversion – type of conversion to perform (‘r’|’reverse’, ‘c’|’complement’, ‘rc’|’reverse_complement’) (default: ‘rc’|’reverse_complement’)
- Returns
converted sequence
-
prestools.bioinf.
rpkm
(counts: numpy.ndarray, lengths: numpy.ndarray) → numpy.ndarray[source]¶ Calculate reads per kilobase transcript per million reads.
RPKM = (10^9 * C) / (N * L)
Where: C = Number of reads mapped to a gene N = Total mapped reads in the experiment L = Exon length in base pairs for a gene
- Parameters
counts – count data where columns are individual samples and rows are genes, of shape (N_genes, N_samples)
lengths – gene lengths in base pairs in the same order as the rows in counts, of shape (N_genes, )
- Returns
- RPKM normalized counts matrix, of
shape (N_genes, N_samples)
- Return type
normed
-
prestools.bioinf.
shuffle_sequence
(sequence: str) → str[source]¶ Shuffle the given sequence.
Randomly shuffle a sequence, maintaining the same composition.
- Parameters
sequence – input sequence to shuffle
- Returns
shuffled sequence
- Return type
tmp_seq
-
prestools.bioinf.
tajima_nei_distance
(seq_1: str, seq_2: str) → float[source]¶ Calculate the Tajima-Nei distance between two sequences.
Return the Tajima-Nei distance between seq_1 and seq_2, calculated as distance = -b log(1 - p / b) where b = 0.5 * [1 - Sum i from A to T(Gi^2+p^2/h)] h = Sum i from A to G(Sum j from C to T (Xij^2/2*Gi*Gj)) p = p-distance Xij = frequency of pair (i,j) in seq1 and seq2, with gaps removed Gi = frequency of base i over seq1 and seq2
- Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
- Returns
Tajima-Nei distance
- Return type
distance
-
prestools.bioinf.
tamura_distance
(seq_1: str, seq_2: str) → float[source]¶ Calculate the Tamura distance between two sequences.
Return the Tamura distance between seq_1 and seq_2, calculated as distance = -C log(1 - P/C - Q) - 0.5(1 - C)log(1 - 2Q) where P = transition frequency Q = transversion frequency C = GC1 + GC2 - 2 * GC1 * GC2 GC1 = GC-content of seq_1 GC2 = GC-content of seq_2
- Parameters
seq_1 – first sequence to compare
seq_2 – second sequence to compare
- Returns
Tamura distance
- Return type
distance
prestools.clustering¶
-
prestools.clustering.
find_n_clusters_elbow
(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], plot: bool = False, method: str = 'ward') → Union[int, None, ValueError][source]¶ Find the suggested number of clusters using the elbow method.
Find the suggested number of clusters for the given dataframe of correlations, using the elbow method.
- Parameters
df – input dataframe of correlations
plot – plot the resulting elbow plot (default: False)
method – method to use to cluster the data (‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’) (default: ‘ward’)
- Returns
number of clusters found
- Return type
n_clusters
-
prestools.clustering.
hierarchical_clustering
(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], method: str = 'ward') → Union[prestools.classes.HierCluster, None, ValueError][source]¶ Hierarchical cluster of a dataframe.
Return clustering created using scipy from a given dataframe of correlations, using the HierCluster class available in prestools.classes.
- Parameters
df – input dataframe of correlations
method – method to use to cluster the data (‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’) (default: ‘ward’)
- Returns
instance of prestools.classes.HierCluster()
- Return type
cl
prestools.graph¶
-
prestools.graph.
flatten_image
(img: numpy.ndarray, scale: bool = False) → numpy.ndarray[source]¶ Convert an image array to a single-dimension vector.
- Parameters
img – input image array of shape (l, h, d = 3)
scale – scale resulting vector dividing its values by 255 (default: False)
- Returns
reshaped vector of shape (l * h * d, 1)
- Return type
v
-
prestools.graph.
plot_confusion_matrix
(cm: numpy.ndarray, class_names: List[str], title: str = 'Confusion Matrix', cmap: str = 'Reds', normalize: bool = False, save: Union[bool, str] = False)[source]¶ Create a plot from a confusion matrix array.
- Parameters
cm – input confusion matrix array
class_names – class names to use
title – title for resulting plot (default: ‘Confusion Matrix’)
cmap – colormap to use (default: ‘RdBu_r’)
normalize – use classes ratios instead of raw numbers (default: False)
save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)
-
prestools.graph.
plot_dendrogram
(df: Union[pandas.core.frame.DataFrame, numpy.ndarray], cut_off: Union[bool, float] = False, title: str = 'Dendrogram', save: Union[bool, str] = False, method: str = 'ward')[source]¶ Plot a dendrogram plot from a dataframe.
Create (and optionally save) a dendrogram plot starting from a given dataframe of correlations. It is also possible to add a cut-off line given a distance to use for separating clusters.
- Parameters
df – input dataframe of correlations
cut_off – if not False, a vertical line will be added to better identify clusters (default: False)
title – title for resulting plot (default: ‘Dendrogram’)
save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)
method – method to use to cluster the data (default: ‘ward’)
-
prestools.graph.
plot_heatmap_dendrogram
(df: pandas.core.frame.DataFrame, cmap: str = 'RdBu_r', title: str = 'Cluster Heatmap', save: Union[bool, str] = False, method: str = 'ward')[source]¶ Plot a heatmap with hierarchical clustering of a dataframe.
Create (and optionally save) a heatmap with hierarchical clustering created using Seaborn, starting from a given dataframe of correlations.
- Parameters
df – input dataframe of correlations
cmap – colormap to use (default: ‘RdBu_r’)
title – title for resulting plot (default: ‘Cluster Heatmap’)
save – if False, the plot will not be saved, just shown; otherwise it is possible to specify the path/filename where the file will be saved (default: False)
method – method to use to cluster the data (default: ‘ward’)
prestools.misc¶
-
prestools.misc.
apply_parallel
(df: pandas.core.frame.DataFrame, function: Callable, cores: int = 4) → pandas.core.frame.DataFrame[source]¶ Apply a function to a dataframe in parallel.
Apply the given function to the dataframe, using the given number of cores for computation. The dataframe will be split in cores part, and the function will be applied to each separately; finally, the dataframe is reconstructed and returned.
- Parameters
df – input dataframe
function – function to apply
cores – number of cores to use (default: 4)
- Returns
resulting dataframe
- Return type
df
-
prestools.misc.
benchmark
(function: Callable) → Callable[source]¶ Benchmark a given function.
Decorator to run the given function and return the function name and the amount of time spent in executing it.
- Parameters
function – function to benchmark
-
prestools.misc.
equal_files
(file1: str, file2: str) → bool[source]¶ Check whether two files are identical.
First check whether the files have the same size, if so read them and check their content for equality.
- Parameters
file1 – first file to compare
file2 – second file to compare
-
prestools.misc.
filter_type
(input_list: List[Any], target_type: Type) → List[Any][source]¶ Only keep elements of a given type from a list of elements.
Traverse a list and return a new list with only elements of the original list belonging to a given type.
- Parameters
input_list – input list to filter
target_type – desired type to keep
- Returns
filtered list
- Return type
filtered
-
prestools.misc.
flatten
(iterable: Iterable, drop_null: bool = False) → List[Any][source]¶ Flatten out a nested iterable.
Flatten a nested iterable, even with multiple nesting levels and different data types. It is also possible to drop null values (None) from the resulting list.
- Parameters
iterable – nested iterable to flatten
drop_null – filter out None from the flattened list (default: False)
- Returns
flat list
-
prestools.misc.
invert_dict
(input_dict: dict, sort_keys: bool = False) → dict[source]¶ Create a new dictionary swapping keys and values.
Invert a given dictionary, creating a new dictionary where each key is created from a value of the original dictionary, and its value is the key that it was associated to in the original dictionary (e.g. invert_dict({1: [“A”, “E”], 2: [“D”, “G”]}) = {“A”: 1, “E”: 1, “D”: 2, “G”: 2}). It is also possible to return an inverted dictionary with keys in alphabetical order, although this makes little sense for intrinsically unordered data structures like dictionaries, but it may be useful when printing the results.
- Parameters
input_dict – original dictionary to be inverted
sort_keys – sort the keys in the inverted dictionary in alphabetical order (default: False)
- Returns
inverted dictionary
- Return type
new_dict
-
prestools.misc.
prime_factors
(number: int) → List[int][source]¶ Calculate the prime factors of a number.
Calculate the prime factors of a given natural number. Note that 1 is not a prime number, so it will not be included.
- Parameters
number – input natural number
- Returns
list of prime factors
- Return type
factors
-
prestools.misc.
wordcount
(sentence: str, word: Union[bool, str] = False, ignore_case: bool = False) → Union[dict, int][source]¶ Count occurrences of words in a sentence.
Return the number of occurrences of each word in the given sentence, in the form of a dictionary; it is also possible to directly return the number of occurrences of a specific word.
- Parameters
sentence – input sentence to count words from
word – target word to count occurrences of
ignore_case – ignore case in the given sentence (default: False)
- Returns
dictionary of word counts
- Return type
word_dict
Command Line Interface¶
prestools bioinf¶
bioinf¶
Bioinformatics utilities
bioinf [OPTIONS] COMMAND [ARGS]...
hamming-distance¶
Hamming distance between two sequences
Calculate the Hamming distance between SEQ_1 and SEQ_2.
bioinf hamming-distance [OPTIONS] SEQ_1 SEQ_2
Options
-
-i
,
--ignore_case
¶
Ignore case when comparing sequences (default: False)
Arguments
-
SEQ_1
¶
Required argument
-
SEQ_2
¶
Required argument
jukes-cantor-distance¶
Jukes-Cantor distance between two sequences
Return the Jukes-Cantor distance between SEQ_1 and SEQ_2, calculated as distance = -b log(1 - p/b) where b = 3/4 and p = p_distance.
bioinf jukes-cantor-distance [OPTIONS] SEQ_1 SEQ_2
Arguments
-
SEQ_1
¶
Required argument
-
SEQ_2
¶
Required argument
kimura-distance¶
Kimura 2-Parameter distance between two sequences
Return the Kimura 2-Parameter distance between SEQ_1 and SEQ_2, calculated as distance = -0.5 log((1 - 2p -q) * sqrt( 1 - 2q )) where p = transition frequency and q = transversion frequency.
bioinf kimura-distance [OPTIONS] SEQ_1 SEQ_2
Arguments
-
SEQ_1
¶
Required argument
-
SEQ_2
¶
Required argument
p-distance¶
Pairwise distance between two sequences
Return the uncorrected distance between SEQ_1 and SEQ_2.
bioinf p-distance [OPTIONS] SEQ_1 SEQ_2
Arguments
-
SEQ_1
¶
Required argument
-
SEQ_2
¶
Required argument
random-sequence¶
Create a random sequence of the given length
Create a random sequence of the given LENGTH using the specified ALPHABET (nucleotides or aminoacids).
bioinf random-sequence [OPTIONS] LENGTH
Options
-
-a
,
--alphabet
<alphabet>
¶ Character alphabet to use to create the sequence (‘nt’, ‘aa’) (default: ‘nt’)
- Options
nt|aa
Arguments
-
LENGTH
¶
Required argument
reverse-complement¶
Convert a nucleotide sequence into its reverse complement
Convert a nucleotide SEQUENCE into its reverse, complement or reverse complement.
bioinf reverse-complement [OPTIONS] SEQUENCE
Options
-
-c
,
--conversion
<conversion>
¶ Type of conversion to perform (‘r’|’reverse’, ‘c’|’complement’, ‘rc’|’reverse_complement’) (default: ‘rc’|’reverse_complement’)
- Options
reverse|complement|reverse_complement|r|c|rc
Arguments
-
SEQUENCE
¶
Required argument
shuffle-sequence¶
Shuffle the given sequence
Randomly shuffle a SEQUENCE, maintaining the same nucleotide composition.
bioinf shuffle-sequence [OPTIONS] SEQUENCE
Arguments
-
SEQUENCE
¶
Required argument
tajima-nei-distance¶
Tajima-Nei distance between two sequences
Return the Tajima-Nei distance between SEQ_1 and SEQ_2, calculated as distance = -b log(1 - p / b) where b = 0.5 * [1 - Sum i from A to T(Gi^2+p^2/h)] h = Sum i from A to G(Sum j from C to T (Xij^2/2*Gi*Gj)) p = p-distance Xij = frequency of pair (i,j) in SEQ_1 and SEQ_2, with gaps removed Gi = frequency of base i over SEQ_1 and SEQ_2
bioinf tajima-nei-distance [OPTIONS] SEQ_1 SEQ_2
Arguments
-
SEQ_1
¶
Required argument
-
SEQ_2
¶
Required argument
tamura-distance¶
Tamura distance between two sequences
Return the Tamura distance between SEQ_1 and SEQ_2, calculated as distance = -C log(1 - P/C - Q) - 0.5(1 - C)log(1 - 2Q) where P = transition frequency Q = transversion frequency C = GC1 + GC2 - 2 * GC1 * GC2 GC1 = GC-content of SEQ_1 GC2 = GC-content of SEQ_2
bioinf tamura-distance [OPTIONS] SEQ_1 SEQ_2
Arguments
-
SEQ_1
¶
Required argument
-
SEQ_2
¶
Required argument
prestools clustering¶
clustering¶
Data clustering utilities
clustering [OPTIONS] COMMAND [ARGS]...
find-n-clusters-elbow¶
Find the number of clusters using the elbow method
Find the suggested number of clusters for the given dataframe of correlations, using the elbow method.
clustering find-n-clusters-elbow [OPTIONS] DF
Options
-
-m
,
--method
<method>
¶ Method to be used to cluster the data [‘ward’, ‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’] (default = ‘ward’)
- Options
ward|single|complete|average|weighted|centroid|median
Arguments
-
DF
¶
Required argument
prestools misc¶
misc¶
Miscellaneous utilities
misc [OPTIONS] COMMAND [ARGS]...
equal-files¶
Check whether two files are identical
First check whether FILE1 and FILE2 have the same size, if so read them and check their content for equality.
misc equal-files [OPTIONS] FILE1 FILE2
Arguments
-
FILE1
¶
Required argument
-
FILE2
¶
Required argument
prime-factors¶
Calculate the prime factors of a number
Calculate the prime factors of a given natural NUMBER. Note that 1 is not a prime number, so it will not be included.
misc prime-factors [OPTIONS] NUMBER
Arguments
-
NUMBER
¶
Required argument
wordcount¶
Count occurrences of words in a sentence
Return the number of occurrences of each word in the given SENTENCE, in the form of a dictionary; it is also possible to directly return the number of occurrences of a specific WORD.
misc wordcount [OPTIONS] SENTENCE
Options
-
-w
,
--word
<word>
¶ Target word to count occurrences of
-
-i
,
--ignore_case
¶
Ignore case in the given sentence (default: False)
Arguments
-
SENTENCE
¶
Required argument