Notes on Data Science Tools
Interactive applications
Shiny App
Mastering Shiny by Hadley Wickham
A Shiny app that uses ‘reactive’ data
UBC STAT545: All the shiny things
Django
Documentation: Getting started
Udemy: Python and Django Full Stack Web Developer Bootcamp
SQL
Guru99: Top 50 SQL Interview Questions & Answers
SQL databases and R (Data Carpentry)
Functions
-
Join
-
Subquery
-
Windows
Search Text with Regular Expressions
NoSQL databases
-
Elastic Search
-
MongoDB
-
CouchDB
-
Cassandra
-
HBase
Questions
-
What are the two types of SQL?
-
What’s a relational database?
-
What’s a table?
-
What the difference between structured data and unstructured data?
-
How do you create a table?
R
Missing values
Python Programming
Data Types
-
Strings: ‘python’, “python”
-
Numbers: 10, 10.5 10+5j
-
Lists: [‘python’, ‘website’]
-
Tuples: (‘python’, ‘website’)
-
Dictionary: {‘name’:’python’, ‘number’:1}
-
Sets: {1,2,3}
-
Boolean: 0, 1, True, False
Function
Class
Data structure
Array Basic Sorting
LinkedList
Recursion
Heap
Queue and Stack
Binary Search
Binary Tree
Advanced Tree
-
complete tree
-
segment tree
-
trie tree
DFS/BFS
HashTable
ML packages
Auto-sklearn: automatically searches for the right learning algorithm for a new machine learning dataset and optimizes its hyperparameters Website Github folder
Parallel Computing/HPC/Cloud Computing
Speed up Python
Optimizes Python for Intel architectures using low-level, high-performance libraries like MKL. Can provide massive speedup for linear algebra routines and ML algorithms.
Just-in-time compiler (using LLVM) for Python. Replaces slow Python code with optimized machine code at runtime. Super easy to use.
Automatically vectorizes apply calls, or replaces them with the best alternative.
*Dask
Provides parallelism for analytics by extending arrays, dataframes, and lists to “parallel” versions that are ready for distributed environments, plus provides a dynamic task scheduler.
Compile Python into C extensions. General use tool that can have more flexibility and power than simpler alternatives, at the cost of difficulty.
Runs Python code on distributed Spark clusters. Great for processing big data sets.
Agile Development
Visualization
Seaborn
GGplot
D3.js
Matplotlib
Tableau
QlikView
Git
Excel
Big Data Tools
5 Full Stack Data Science Technologies for 2020
AWS
Hadoop
MongoDB
Neo4j
Spark
-
Spark ML
-
Spark RDD
Flink Streaming
Hive
BigQuery
Hbase
Cassandra
Business Intelligence Software
PowerBI
KNIME
Alteryx
Qlik
OBIEE
Web analytics tools (GoogleAnalytics, Adobe, etc.)
Web scraping
Beautiful Soup
URLLIB
Scrapy
Design
6 Google Slides image editing hacks
Leave a Comment