美文网首页
Lesson 5 ETL

Lesson 5 ETL

作者: 可苟可远 | 来源:发表于2021-01-08 00:09 被阅读0次

The importance of ETL

OLTP vs OLAP: Frequent DML vs mulit-dimensional data read

Table in OLTP DBs --ETL--> Data Wh -->Data Mart-->Decision making

dimension table vs fact table

根据事实表维度表的关系,又可将常见的模型分为星型模型(反规范化数据/冗余)和雪花型模型

Row Oriented DB

001:10,Smith,Joe,60000;10:001,12:002,11:003,22:004;

002:12,Jones,Mary,80000;

003:11,Johnson,Cathy,94000;

004:22,Jones,Bob,55000;

Column Oriented DB -->Aggregation/Compression

Smith:001,Jones:002,Johnson:003,Jones:004;

Joe:001,Mary:002,Cathy:003,Bob:004;

60000:001,80000:002,94000:003,55000:004;

Common column-oriented databases:

Redshift BigQuery Snowflake

ETL Solutions: ETL Tools vs. Hand Written SQL 

ETL Skills for the Data Scientist:

·      Don’t do everything in Python/R – you are doing data manipulation use a database tool

·      Run the extract in the memory of the source database

·      Limit the transaction extraction by a variable set to be only those greater than the start date/time of the last successful run

·      Transfer to destination database prior to joins

·      Execute joins/transformations in destination database

·      Build any aggregations needed to speed performance

·      Build appropriate checks in the process. Every ETL will eventually fail. By setting up the last successful run variable, the ETL should be setup so that it will automatically catch up. A full re-build may be necessary and that is just setting the date/time of the start of the last good run to the date/time of the first transaction.

·      Build the joins tables assuming you will re-use them. Aggregations and transformations may vary as time goes by so do that in a separate step.

·      Unless you are working with very large data sets, use as many intermediate tables on your destination database as you need. You’ll get better performance results and the extra storage space is no big deal.

Common Tasks - SQL, Pandas
SQL

• Upper/lower convert

• Char/date/number convert

• Regexp

• Nvl/nvl2/coalesce

• Join

• Group by

• DML

Common Tasks - SQL, Pandas

Pandas

• Slice/iloc vs loc/apply

• Index/column convert

• Data type convert

• Format convert

• isnull/fillna/dropna/replace

• append/join/concat

• sort_values/groupby/pivot_table

sklearn六大板块:分类 回归 聚类 数据降维 数据预处理 特征抽取

sklearn中常用的模块有预处理、分类、回归、聚类、降维、模型选择。

预处理(Preprocessing):特征提取和归一化

常用的模块有:preprocessing,feature extraction

常见的应用有:把输入数据(如文本)转换为机器学习算法可用的数据。

分类(Classification):识别某个对象属于哪个类别

常用的算法有:SVM(支持向量机)、nearest neighbors(最近邻)、random forest(随机森林)

常见的应用有:垃圾邮件识别、图像识别。

回归(Regression):预测与对象相关联的连续值属性

常见的算法有:SVR(支持向量机)、 ridge regression(岭回归)、Lasso

常见的应用有:药物反应,预测股价。

聚类(Clustering):将相似对象自动分组

常用的算法有:k-Means、 spectral clustering、mean-shift

常见的应用有:客户细分,分组实验结果。

降维(Dimensionality Reduction):减少要考虑的随机变量的数量

常见的算法有:PCA(主成分分析)、feature selection(特征选择)、non-negative matrix factorization(非负矩阵分解)

常见的应用有:可视化,提高效率。

模型选择(Model Selection):比较,验证,选择参数和模型

常用的模块有:grid search(网格搜索)、cross validation(交叉验证)、 metrics(度量)

它的目标是通过参数调整提高精度。

SKlearn | 学习总结的更多相关文章

附:算法选择路径

这个图屌啊,不看此文的同学错过一个亿。 损失第二个亿就没什么感觉了

petl (200+Functions...)
>>> frompetlimport*

>>> table1=fromcsv('example1.csv')

>>> table2=convert(table1,'foo','upper')

>>> table3=convert(table2,'bar',int)

>>> table4=convert(table3,'baz',float)

>>> table5=addfield(table4,'quux',expr('{bar} * {baz}'))

...no actual transformation work will be done, until data are requested from table5 or any of the other row containers returned by the intermediate steps. So in effect, a 5 step transformation pipeline has been set up.

>>> look(table5)

作业:

Create Python script in AWS EC2 to fetch data from finnhub, and load it to AWS RDS:

• You can choose any data;

• You can use any of these libs: request, finnhub-python, petl, pandas, etc.

• Create shell script to run above job automatically and recursively:

• You can use any job scheduling tools, prefer use crontab

• Implement simple alter/notification system:

• Notify ETL job finished status, either via email or SMS

• Sending Alert during etl job if anything need immediate attention

使用 GitHub Actions 实现 CI/CD

Tasks: 

1.7 local IDE; Finhub reg; API usage and format; =get first data

1.8 API get data tone; Connect RDS from local; DB design and create; 

1.9 local done; Github pull and push from/to AWS; AWS deploy;

1.10 crontab; email alert; 

1.11 sms alert;  

1.12 GithubActions

https://www.linkedin.com/pulse/fintech-blog-1-step-by-step-guide-analyzing-stock-data-zachary-blitz/

Setting Up VSCode For Python Programming

一入DS深似海,从此玩耍是路人。

相关文章

  • Lesson 5 ETL

    The importance of ETL OLTP vs OLAP: Frequent DML vs mulit...

  • 前5名的企业ETL工具

    来源:https://dzone.com/articles/top-5-enterprise-etl-tools ...

  • ES6零基础教学 解析彩票项目 源码

    lesson1 lesson2 lesson3 lesson4 lesson5 lesson6 lesson7 l...

  • Lesson 5

    I want to get some coffee, so I walk into the break room....

  • Lesson 5

    学习内容: 课本P12Lesson 5,学习单词拆分,识别音阶 复习要求: sail - boat 去掉 sail...

  • lesson 5

    1、PCA主成分分析就是将输入的矩阵通过简单的线性变换转换成一个用较少数量的特征就可以覆盖原矩阵大部分元素的新矩阵...

  • LESSON 5

    在学校,成绩和排名很重要。毕业之后重要的东西就不止是成绩了,自信、勇敢等,都有着至关重要的作用。 你的财经天赋需要...

  • LESSON 5

    你的财经天赋需要同时知识技能和勇气,如果恐惧感过强,则天赋会被掩盖 钱多才有选择 他们在等待“最好的”选择,当他们...

  • lesson 5

    除了精通你的专业之外,还需要对各个领域有所了解,其中最有用的是销售和了解市场。 你可能比麦当劳的汉堡包做得好,但是...

  • Ruby ETL 工具漫谈

    activewarehouse-etl ActiveWarehouse ETL 应该是最早的Ruby ETL工具,...

网友评论

      本文标题:Lesson 5 ETL

      本文链接:https://www.haomeiwen.com/subject/pyetaktx.html