Useful Code
Here is where I put Python
code snippets that I use in my Machine Learning
research work. I'm using this page to have code easily accessible and to be able to share it with others.
Tip: use Table of contents on the top-right side of the page to avoid endless scrolling, and is a good idea to use Copy to clipboard button on the upper right corner of each code cell to get things done quickly.
Read FIle
One liner to read any file:
io.open("my_file.txt", mode='r', encoding='utf-8').read()
import io
Write File
One liner to write a string to a file:
io.open("my_file.txt", mode='w', encoding='utf-8').write("Your text!")
import io
Debug
Start debugging after this line:
import pdb; pdb.set_trace()
dir()
to see all current variables, locals()
to see variables and their values and globals()
to see all global variables with values.
Pip Install GitHub
Install library directly from GitHub using pip:
pip install git+github_url
@version_number
at the end to use a certain version to install.
Parse Argument
Parse arguments given when running a .py
file.
parser = argparse.ArgumentParser(description='Description')
parser.add_argument('--argument', help='Help me.', type=str)
# parse arguments
args = parser.parse_args()
import argparse
and use python script.py --argument something
when running script.
Create Arguments from Dictionary
Create argparse
arguments from dicitonary.
import argparse
PARAMETERS= {
"lm": "bert",
"bert_model_name": "bert-large-cased",
"bert_model_dir":
"pre-trained_language_models/bert/cased_L-24_H-1024_A-16",
"bert_vocab_name": "vocab.txt",
"batch_size": 32
}
args = argparse.Namespace(**PARAMETERS)
Doctest
How to run a simple unittesc using function documentaiton. Useful when need to do unittest inside notebook:
# function to test
def add(a, b):
'''
>>> add(2, 2)
5
'''
return a + b
# run doctest
import doctest
doctest.testmod(verbose=True)
Fix Text
I use this package everytime I read text data from a source I don't trust. Since text data is always messy, I always use it. It is great in fixing any bad Unicode.
fix_text(text="Text to be fixed")
pip install ftfy
and import it from ftfy import fix_text
.
Current Date
How to get current date in Python. I use this when need to name log files:
from datetime import date
today = date.today()
# dd/mm/YY in string format
today.strftime("%d/%m/%Y")
Current Time
Get current time in Python:
from datetime import datetime
# datetime object containing current date and time
now = datetime.now()
# dd/mm/YY H:M:S
now.strftime("%d/%m/%Y %H:%M:%S")
Remove Punctuation
The fastest way to remove punctuation in Python3:
table = str.maketrans(dict.fromkeys(string.punctuation))
"string. With. Punctuation?".translate(table)
string
. Code adapted from StackOverflow Remove punctuation from Unicode formatted strings.
Class Instances from Dictionary
Create class instances from dictionary. Very handy when working with notebooks and need to pass arguments as class instances.
# Create dictionary of arguments.
my_args = dict(argument_one=23, argument_two=False)
# Convert dicitonary to class instances.
my_args = type('allMyArguments', (object,), my_args)
Details: Code adapted from StackOverflow Creating class instance properties from a dictionary?
List of Lists into Flat List
Given a list of lists convert it to a single flat size list. It is the fasest way to conserve each elemnt type.
l = [[1,2,3],[4,5,6], [7], [8,9], ['this', 'is']]
functools.reduce(operator.concat, l)
operator, functools
. Code adapted from StackOverflow How to make a flat list out of list of lists?
Pickle and Unpickle
Save python objects into binary using pickle. Load python objects from binary files using pickle.
a = {'hello': 'world'}
with open('filename.pickle', 'wb') as handle:
pickle.dump(a, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('filename.pickle', 'rb') as handle:
b = pickle.load(handle)
pickle
. Code adapted from StackOverflow How can I use pickle to save a dict?
Notebook Input Variables
How to ask user for input value to a variable. In the case of a password variable how to ask for a password variable.
from getpass import getpass
# Populate variables from user inputs.
user = input('User name: ')
password = getpass('Password: ')
Notebook Clone private Repository GitHub
How to clone a private repo. Will need to login and ask for password. This snippet can be ran multiple times because it first check if the repo was cloned already.
import os
from getpass import getpass
# Repository name.
repo = 'gmihaila/ml_things'
# Remove .git extension if present.
repo = repo[:-4] if '.git' in repo else repo
# Check if repo wasn't already cloned
if not os.path.isdir(os.path.join('/content', os.path.basename(repo))):
# Use GitHub username.
u = input('GitHub username: ')
# Ask user for GitHub password.
p = getpass('GitHub password: ')
# Clone repo.
!git clone https://$u:$p@github.com/$repo
# Remove password variable.
p = ''
else:
# Make sure repo is up to date - pull.
!git -C /content/dialogue_dataset pull
Details: Code adapted from StackOverflow Methods for using Git with Google Colab
Import Module Given Path
How to import a module from a local path. Make it act as a installed library.
import sys
# Append module path.
sys.path.append('/path/to/module')
import module.stuff
. Code adapted from StackOverflow Adding a path to sys.path (over using imp).
PyTorch
Code snippets related to PyTorch:
Dataset
Code sample on how to create a PyTorch Dataset. The __len__(self)
function needs to return the number of examples in your dataset and _getitem__(self,item)
will use the index item
to select an example from your dataset:
from torch.utils.data import Dataset, DataLoader
class PyTorchDataset(Dataset):
"""PyTorch Dataset.
"""
def __init__(self,):
return
def __len__(self):
return
def __getitem__(self, item):
return
# create pytorch dataset
pytorch_dataset = PyTorchDataset()
# move pytorch dataset into dataloader
pytorch_dataloader = DataLoader(pytorch_dataset, batch_size=32, shuffle=True)
PyTorch Device
How to setup device
in PyTorch to detect if GPU is available. If there is no GPU available it will default to CPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Get all files paths
How to get all files paths from a folder with multiple subpaths.
from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))
"*.[tT][xX][tT]"
is for .txt
extension where letters can vary (.txt or .TXT). Code adapted from StackOverflow Recursive sub folder search and return files in a list python
.
Title
Logging in both log file and stdout.
import logging
import sys
# Setup logging to show in stdout and log file.
file_handler = logging.FileHandler('{}.log'.format(os.path.splitext(__file__)[0]))
stdout_handler = logging.StreamHandler(sys.stdout)
handlers = [file_handler, stdout_handler]
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
handlers=handlers,
level=logging.DEBUG,
)
logging.info("This is a test")
Compressed JSON Lines text file
From the JSON Lines documentation: JSON Lines text format, also called newline-delimited JSON. JSON Lines is a convenient format for storing structured data that may be processed one record at a time.
As a bonus I will compress the JSON Line text file to save storage. Since this file can be read line by line, it does not need to be loaded entirely into memory!
import gzip
import json
dicitionary_file = "dictionary_list.jsonl.gz"
dictionary_list = [{"user": "admin",
"age": 20},
{"user": "me",
"age": 25,
"notes": "this is a test"}]
with gzip.open(filename=dicitionary_file, mode="w") as file_object:
for dictionary_entry in dictionary_list:
dictionary_entry_json_string = f"{json.dumps(dictionary_entry)}\n"
dictionary_entry_json_byte = dictionary_entry_json_string.encode('utf-8')
file_object.write(dictionary_entry_json_byte)
with gzip.open(filename=dicitionary_file, mode="rb") as file_object:
for entry in file_object:
print(f"FILE LINE: {json.loads(entry.decode('utf-8'))}")
Details: Code adapted from StackOverflow Python 3, read/write compressed json objects from/to gzip file.
List of lists to list (flatten list)
Flatten a list of lists into a single list of values. Note: This only works for listst nested in a list, it does not work for more nested lists.
import itertools
list2d = [[1,2,3], [4,5,6], [7], [8,9]]
list(itertools.chain(*list2d))