Notes on Data Science Tools

1 minute read

Interactive applications

Shiny App

Mastering Shiny by Hadley Wickham

A Shiny app that uses ‘reactive’ data

Create date range input

UBC STAT545: All the shiny things

Django

Documentation: Getting started

DJ4E: Django for Everybody

Udemy: Python and Django Full Stack Web Developer Bootcamp

SQL

Guru99: Top 50 SQL Interview Questions & Answers

CHEATSHEET: SQL & MYSQL

Review: SQL Problems

SQL databases and R (Data Carpentry)

Functions

  • Join

  • Subquery

  • Windows

Search Text with Regular Expressions

MySQL: Regular Expressions

NoSQL databases

  1. Elastic Search

  2. MongoDB

  3. CouchDB

  4. Cassandra

  5. HBase

Questions

  • What are the two types of SQL?

  • What’s a relational database?

  • What’s a table?

  • What the difference between structured data and unstructured data?

  • How do you create a table?

R

Book: R for Data Science

Datacamp: subsetting data

Missing values

Dealing with Missing Values

Python Programming

Data Types

  • Strings: ‘python’, “python”

  • Numbers: 10, 10.5 10+5j

  • Lists: [‘python’, ‘website’]

  • Tuples: (‘python’, ‘website’)

  • Dictionary: {‘name’:’python’, ‘number’:1}

  • Sets: {1,2,3}

  • Boolean: 0, 1, True, False

Function

Class

Data structure

Array Basic Sorting

LinkedList

Recursion

Heap

Queue and Stack

Binary Tree

Advanced Tree

  1. complete tree

  2. segment tree

  3. trie tree

DFS/BFS

HashTable

ML packages

scikit-learn

Auto-sklearn: automatically searches for the right learning algorithm for a new machine learning dataset and optimizes its hyperparameters Website Github folder

Parallel Computing/HPC/Cloud Computing

Speed up Python

Optimizes Python for Intel architectures using low-level, high-performance libraries like MKL. Can provide massive speedup for linear algebra routines and ML algorithms.

*Numba

Just-in-time compiler (using LLVM) for Python. Replaces slow Python code with optimized machine code at runtime. Super easy to use.

*swiftapply

Automatically vectorizes apply calls, or replaces them with the best alternative.

*Dask

Provides parallelism for analytics by extending arrays, dataframes, and lists to “parallel” versions that are ready for distributed environments, plus provides a dynamic task scheduler.

*Cython

Compile Python into C extensions. General use tool that can have more flexibility and power than simpler alternatives, at the cost of difficulty.

*PySpark

Runs Python code on distributed Spark clusters. Great for processing big data sets.

Kyle McKiou Blog

Agile Development

Visualization

Visual Vocabulary

Seaborn

GGplot

D3.js

Matplotlib

Tableau

QlikView

Git

Excel

Pivot Tables

Big Data Tools

5 Full Stack Data Science Technologies for 2020

AWS

Hadoop

MongoDB

Neo4j

Spark

  1. Spark ML

  2. Spark RDD

Flink Streaming

Hive

BigQuery

Hbase

Cassandra

Business Intelligence Software

PowerBI

KNIME

Alteryx

Qlik

OBIEE

Web analytics tools (GoogleAnalytics, Adobe, etc.)

Web scraping

Beautiful Soup

URLLIB

Scrapy

Design

6 Google Slides image editing hacks

Canva-Online Design Tool

Latex

Mathematics in R Markdown

Markdown

Basic Syntax

Math

Leave a Comment