Useful Code

Here is where I put Python code snippets that I use in my Machine Learning research work. I'm using this page to have code easily accessible and to be able to share it with others.

Tip: use Table of contents on the top-right side of the page to avoid endless scrolling, and is a good idea to use Copy to clipboard button on the upper right corner of each code cell to get things done quickly.

Read FIle

One liner to read any file:

io.open("my_file.txt", mode='r', encoding='utf-8').read()

Details: import io

Write File

One liner to write a string to a file:

io.open("my_file.txt", mode='w', encoding='utf-8').write("Your text!")

Details: import io

Debug

Start debugging after this line:

import pdb; pdb.set_trace()

Details: use dir() to see all current variables, locals() to see variables and their values and globals() to see all global variables with values.

Pip Install GitHub

Install library directly from GitHub using pip:

pip install git+github_url

Details: add @version_number at the end to use a certain version to install.

Parse Argument

Parse arguments given when running a .py file.

parser = argparse.ArgumentParser(description='Description')
parser.add_argument('--argument', help='Help me.', type=str)
# parse arguments
args = parser.parse_args()

Details: import argparse and use python script.py --argument something when running script.

Create Arguments from Dictionary

Create argparse arguments from dicitonary.

import argparse

PARAMETERS= {
        "lm": "bert",
        "bert_model_name": "bert-large-cased",
        "bert_model_dir":
        "pre-trained_language_models/bert/cased_L-24_H-1024_A-16",
        "bert_vocab_name": "vocab.txt",
        "batch_size": 32
        }
args = argparse.Namespace(**PARAMETERS)

Details: Code adapted from GitHub LAMA.

Doctest

How to run a simple unittesc using function documentaiton. Useful when need to do unittest inside notebook:

# function to test
def add(a, b):
'''
>>> add(2, 2)
5
'''
return a + b
# run doctest
import doctest
doctest.testmod(verbose=True)

Details: ml_things

Fix Text

I use this package everytime I read text data from a source I don't trust. Since text data is always messy, I always use it. It is great in fixing any bad Unicode.

fix_text(text="Text to be fixed")

Details: Install it pip install ftfy and import it from ftfy import fix_text.

Current Date

How to get current date in Python. I use this when need to name log files:

from datetime import date

today = date.today()

# dd/mm/YY in string format
today.strftime("%d/%m/%Y")

Details: More details here

Current Time

Get current time in Python:

from datetime import datetime

# datetime object containing current date and time
now = datetime.now()

# dd/mm/YY H:M:S
now.strftime("%d/%m/%Y %H:%M:%S")

Details: More details here

Remove Punctuation

The fastest way to remove punctuation in Python3:

table = str.maketrans(dict.fromkeys(string.punctuation))
"string. With. Punctuation?".translate(table)

Details: Import string. Code adapted from StackOverflow Remove punctuation from Unicode formatted strings.

Class Instances from Dictionary

Create class instances from dictionary. Very handy when working with notebooks and need to pass arguments as class instances.

# Create dictionary of arguments.
my_args = dict(argument_one=23, argument_two=False)
# Convert dicitonary to class instances.
my_args = type('allMyArguments', (object,), my_args)

Details: Code adapted from StackOverflow Creating class instance properties from a dictionary?

List of Lists into Flat List

Given a list of lists convert it to a single flat size list. It is the fasest way to conserve each elemnt type.

l = [[1,2,3],[4,5,6], [7], [8,9], ['this', 'is']]

functools.reduce(operator.concat, l)

Details: Import operator, functools. Code adapted from StackOverflow How to make a flat list out of list of lists?

Pickle and Unpickle

Save python objects into binary using pickle. Load python objects from binary files using pickle.

a = {'hello': 'world'}

with open('filename.pickle', 'wb') as handle:
    pickle.dump(a, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('filename.pickle', 'rb') as handle:
    b = pickle.load(handle)

Details: Import pickle. Code adapted from StackOverflow How can I use pickle to save a dict?

Notebook Input Variables

How to ask user for input value to a variable. In the case of a password variable how to ask for a password variable.

from getpass import getpass
# Populate variables from user inputs.
user = input('User name: ')
password = getpass('Password: ')

Details: Code adapted from StackOverflow Methods for using Git with Google Colab

Notebook Clone private Repository GitHub

How to clone a private repo. Will need to login and ask for password. This snippet can be ran multiple times because it first check if the repo was cloned already.

import os
from getpass import getpass
# Repository name.
repo = 'gmihaila/ml_things'

# Remove .git extension if present.
repo = repo[:-4] if '.git' in repo else repo
# Check if repo wasn't already cloned
if not os.path.isdir(os.path.join('/content', os.path.basename(repo))):
  # Use GitHub username.
  u = input('GitHub username: ')
  # Ask user for GitHub password.
  p = getpass('GitHub password: ')
  # Clone repo.
  !git clone https://$u:$p@github.com/$repo
  # Remove password variable.
  p = ''
else:
  # Make sure repo is up to date - pull.
  !git -C /content/dialogue_dataset pull

Details: Code adapted from StackOverflow Methods for using Git with Google Colab

Import Module Given Path

How to import a module from a local path. Make it act as a installed library.

import sys
# Append module path.
sys.path.append('/path/to/module')

Details: After that we can use import module.stuff. Code adapted from StackOverflow Adding a path to sys.path (over using imp).

PyTorch

Code snippets related to PyTorch:

Dataset

Code sample on how to create a PyTorch Dataset. The __len__(self) function needs to return the number of examples in your dataset and _getitem__(self,item) will use the index item to select an example from your dataset:

from torch.utils.data import Dataset, DataLoader

class PyTorchDataset(Dataset):
  """PyTorch Dataset.
  """

  def __init__(self,):
    return

  def __len__(self):
    return 

  def __getitem__(self, item):
    return

# create pytorch dataset
pytorch_dataset = PyTorchDataset()
# move pytorch dataset into dataloader
pytorch_dataloader = DataLoader(pytorch_dataset, batch_size=32, shuffle=True)

Details: Find more details here

PyTorch Device

How to setup device in PyTorch to detect if GPU is available. If there is no GPU available it will default to CPU.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Details: Adapted from Stack Overflow How to check if pytorch is using the GPU?.

Get all files paths

How to get all files paths from a folder with multiple subpaths.

from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))

Details: The "*.[tT][xX][tT]" is for .txt extension where letters can vary (.txt or .TXT). Code adapted from StackOverflow Recursive sub folder search and return files in a list python .

Title

Logging in both log file and stdout.

import logging
import sys

# Setup logging to show in stdout and log file.
file_handler = logging.FileHandler('{}.log'.format(os.path.splitext(__file__)[0]))
stdout_handler = logging.StreamHandler(sys.stdout)
handlers = [file_handler, stdout_handler]
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    handlers=handlers,
    level=logging.DEBUG,
)

logging.info("This is a test")

Details: This is a perfect setup to log everything in both log file and stdout at the same time. Code adapted from StackOverflow Making Python loggers output all messages to stdout in addition to log file .

Compressed JSON Lines text file

From the JSON Lines documentation: JSON Lines text format, also called newline-delimited JSON. JSON Lines is a convenient format for storing structured data that may be processed one record at a time.

As a bonus I will compress the JSON Line text file to save storage. Since this file can be read line by line, it does not need to be loaded entirely into memory!

import gzip
import json

dicitionary_file = "dictionary_list.jsonl.gz"
dictionary_list = [{"user": "admin",
                    "age": 20},
                   {"user": "me",
                    "age": 25,
                    "notes": "this is a test"}]

with gzip.open(filename=dicitionary_file, mode="w") as file_object:
  for dictionary_entry in dictionary_list:
    dictionary_entry_json_string = f"{json.dumps(dictionary_entry)}\n"
    dictionary_entry_json_byte = dictionary_entry_json_string.encode('utf-8')
    file_object.write(dictionary_entry_json_byte)

with gzip.open(filename=dicitionary_file, mode="rb") as file_object:
  for entry in file_object:
    print(f"FILE LINE: {json.loads(entry.decode('utf-8'))}")

Details: Code adapted from StackOverflow Python 3, read/write compressed json objects from/to gzip file.

List of lists to list (flatten list)

Flatten a list of lists into a single list of values. Note: This only works for listst nested in a list, it does not work for more nested lists.

import itertools
list2d = [[1,2,3], [4,5,6], [7], [8,9]]
list(itertools.chain(*list2d))

Details: Code adapted from StackOverflow How do I make a flat list out of a list of lists?.