Studying lists about data science

数据科学:简单说就是,不要靠拍脑袋下结论,要以数据为根据,让事实说话。

能力范畴3个词:统计编程表述


A PhD Data Scientist: Jack of All trades, master of one.

展开说:统计(能探索数据,建模,设计实验),. 鍥磋鎴戜滑@1point 3 acres
编程(能取数据,洗数据,至少能Prototype自己的data solution,懂基本大数据工作原理(MapReduce)),
表述(化繁为简,口头Present,书面写报告和论文,作图(静态和web))


简历上(+脑子里)如果有这些:你找工作基本没有问题:
Ttest, Regression, ANOVA, Logistic Regression, DOE, Machine Learning, Data Mining, MapReduce, SQL, R/Matlab, Python, Java

=========================================.鏈枃鍘熷垱鑷�1point3acres璁哄潧
本文主要针对IT类行业做数据科学 It does not define a data engineer. Rather, it’s a close call to a “full-stack data scientist”. Master this list and you will not only be able to work for established firms, but startups too.
其他偏重传统行业应用的,应该对表述要求稍高,对其他要求稍低。
面试之前请务必花1周时间学习对方行业的基本内容,wikipedia即可,起码做到熟悉对方行业常用关键字。
如果目的就是有份还可以的工作,请照单子静下心学习。
如果你希望做的很好,三个方面请突出至少一个方面。
要学过来,需要很多时间,如果希望不太费力就做data scientist, OK, dream on!

请不要mark一份学习清单就.Equals(学习任务已经完成了)一样,一起来学起来吧~~~~~~
【墙裂建议贴出你的学习计划,大家一起监督讨论,几位版主有空也会来给建议,坚持下来的有积分奖励】
=========================================
如果有不清楚的请多google.

=========================================
差不多一年前看市面工作还是很混杂的样子,今天又翻了翻,估计年底账目清算,很多公司很多新职位出来了,职位要求解析在此
感觉现在data scientist/researcher之类职位针对性更强,能更清楚看出来到底对方需要的是什么样的人:是啥都会一点的,还是会点统计的码农,还是Machine learning,还是优化、logistics 供应链,还是会点编程的统计师。. 鐣欏鐢宠璁哄潧-涓€浜╀笁鍒嗗湴
(data business person 一般不叫data scientist) 主要用SQL产生报表的BI analyst 也不在此列。
. 1point 3acres 璁哄潧
学习列表一来是准备面试用,二来本来平时就是要用的。我自己学完的mark as green
=========================================
打算把我自己学的一些东西总结在这里欢迎补充。不定期汇总到首楼。
如果你想收藏本帖请点首楼下方的“收藏” -》 确定 -》 然后文章会出现在 “快捷导航”-》收藏里面
如果没有啥具体内容要补充的,请不必回帖了。想加分的可以加分,不加也无所谓。

请别问我某校的Data Science项目如何,你三围如何能否上某校。I have no idea.
. 鐣欏鐢宠璁哄潧-涓€浜╀笁鍒嗗湴
=========================================. 涓€浜�-涓夊垎-鍦帮紝鐙鍙戝竷
基本上是must have:

统计Statistics 统计和机器学习
hypothesis testing, point/interval estimation
pvalue, power, (type 1/2 error)
clt, delta method, derive coef and var(coef) etc
t-test: assumptions, remedy. 适用问题范围basics listed above 请看这个课 http://onlinestatbook.com/2/index.html
glm (lm, logistic regression, anova etc):asssumptions, model selection and validation, diagnostics, remedy 适用问题范围

  times series         Forecast with R
Time Series Analysis and Its Applications: With R Examples (Springer Texts in Statistics)
and its Upitt course

bayesian
Bayesian for hackers (python)
Coursera Graphical Model (VERY nicely explained)
Bayesian reasoning and machine learning book (quite difficult to read)
入门:A first course in Bayes 一下就看完了,很不错

longitudinal, mixed model
doe:all kinds of design, response surface
(?)survival

Machine Learning        Coursera Andrew Ng. 鍥磋鎴戜滑@1point 3 acres
stanford Statistical Learning (Tibshrani & Hastie)
        — 本书还出了一个本科版,着重动手实践,大量R, very easy to read. recommend starting from here. 
Caltech那个learning from Data我没能跟下来

统计软件Statistical Computing: R/Matlab/Python. SAS(?)
R and Matlab 基本被业界认为是等同的。不过Matlab is not free, Octave is free 但是不是那么好用。请考虑自学R。反正你会Matlab 的话pick up R 也就分分钟的事情。
如果其他语言一个都不会,只会SAS Base/Stat,并且你也不想学其他的,那也许数据科学不适合你。如果你非要用SAS不可,请你至少写过macro。SAS的确在大数据的建模里面非常有用,但是跟其他行业差距较大,如果组里其他人都是R/Py/Java 你跟他们交流起来会异常困难。另外软件很贵,很多地方未必愿意买。
注意,我说的是,会SAS是好事,但是不能仅仅只会SAS.
Python: Data Analysis with Python (book), pandas
R: data.table, or plyr, lubridate, reshape2, build a R package, there are now lots of such courses on both udacity and coursera. Start from any.
know how to get data from any source (DB, web, xml, plain text, etc)
EDA (exploratory) – Descriptive stats udacity
Inference – udacity
Plot/explain
read code from your favorite packages. 鐗涗汉浜戦泦,涓€浜╀笁鍒嗗湴

—————————————————–
编程 : A compiled language, and a scripting language
Python 
我比较偏好Udacity一遍教一遍做quiz 的方式,光做题不讲(codecademy)我自己好像学不清楚
    Udacity CS101. Waral 鍗氬鏈夋洿澶氭枃绔�,
    Udacity CS 215 (Algorithm, 比Coursera Princeton and Stanford要简单,快速过一遍不错)
    Udacity (Peter Norvig) CS212 Design of a Computer Program 非常好,强烈推荐

Java 数据结构和算法
1. Udacity java (这门课我花了40小时学完)适合连什么是函数什么是赋值都不知道的人。
2. Data structure 数据结构建议必学       python: Problem Solving with Algorithms and Data Structures)
Java:  Berkeley 61B http://www.cs.berkeley.edu/~jrs/61b/
教材是Head First Java & Data Structures and Algorithms in Java,. 1point3acres.com/bbs
my progress bar: week 5, lab1, hw1.
3. Algorithm:                  Udacity Algo in Python 比较laid back,如果不太希望费劲,可以上这个课,不过还是严肃点好。。。
Java Coursera Algo I&II (Princeton),如果对这个话题有兴趣,
                  不限语言 Stanford Algo I&II也很好,两者不可相互代替。
.鏈枃鍘熷垱鑷�1point3acres璁哄潧
很少会有人学的第一门语言是C#,所以C#还真没有什么特别入门的书,不推荐。如果没从前没学C, java, C++直接看C#的书简直无法理解
C++比较难,对data scientist 来说应用也没有java广。当然如果你是大牛,plz当我没说。

根据我组里面试别人,和我在其他地方面试,量化一下:数科的编程到底需要什么水平?
我假定你有了上述其他的全部功底,除非职位特别强调是统计师,或者叫Data scientist, statistics/analytics,并且职位说明里面对代码完全一带而过,你都可以假设,是需要一些代码能力的 。
具体水平是:
IT公司数科:Leetcode Medium要可做。所以,刷题吧。 
传统公司:不知道.1point3acres

如果你是码农出身,或者做更偏向data engineer的,要求会更高

涉及知识点包括并且不限于:
浮点溢出
边界情况考虑
改进MapReduce算法(beyond brute force)
如果涉及大数据,对时间复杂度要求会比较高
— 其他我想起来了慢慢补

顺手学掉的小零碎:
Regex (a couple of hours) http://deerchao.net/tutorials/regex/regex.htm



SQL (a week) http://www.w3schools.com/sql/    Coursera: Intro to DB

大数据:
MapReduce: some knowledge    Udacity series:    http://blog.udacity.com/2013/11/sebastian-thrun-launching-our-data.html    Coursera: intro to Data Science  
    Coursera: Big data and web intelligence
    learning by doing — yes! wrote my very first reducer for real life projects!    MongoDB (udacity). 涓€浜�-涓夊垎-鍦帮紝鐙鍙戝竷

If your want to be a DS for IT firms, then Maybe:
jquery/ajax (start from codecademy very simple js and jquery intro, then find books)
—————————————————–
web services   get basic idea of how browsers work (udacity – Website Performance optimization). visit 1point3acres.com for more.
udacity web development (build a blog) (40 hours)
—————————————————–
SE.鐣欏璁哄潧-涓€浜�-涓夊垎鍦�
   Software Development Life Cycles (udacity, mostly videos, as a quick intro only), amazingly, this one filled lots of holes in my knowledge base. Highly recommend
Also a book is mentioned here, worth a quick flip through, unfortunately, no ebook that I found works. Martin Fowler, Kent Beck, John Brant, William Opdyke, Don Roberts-Refactoring_ Improving the Design of Existing Code

— this is helpful not only for working in IT, but helps overall coding style/efficiency as well. Wished I’d known earlier.
—————————————————–. visit 1point3acres.com for more.
Linux
Many servers are in linux. at least familiarize yourself with the command line stuff. There’s a not so good course on Edx.
—————————————————–. visit 1point3acres.com for more.
综合/分析/表述/软技能
    软技能难以表述,
技巧不是最重要,想清楚再开口才是关键。突然发现我导师的lab页面竟然是用这些问题开头,深感心有戚戚。

化繁为简,高屋建瓴的表达能力:hide complex formula/engineering details,尽量传达big picture
    个人经验是,习得这些能力最好的办法是:去讲,不要自顾自的讲话,请随时关注听众是否听懂,鼓励对方马上提问,回答问题要选取符合对方背景的关键字,而不是“自己熟悉”的关键字。不要用缩写,小范围术语。多讲清楚intuition,少堆积公式。
    1. 教一门自己专业的入门课,e.g 统计学生,去给其他专业的人讲入门统计,例子:请给完全不懂统计的人讲,什么是pvalue, power, false positive, randomization, inference etc. 
    2. Consulting – 有些学校会有这种session,别觉得浪费时间,去把别人讲懂,去看看别人用你的专业技术做什么问题,他们的思路跟你哪里不同,你如何理解他们,如何让他们理解你。
    3. 做presentation – 不要像专业学术会议上那样去讲,要向给别人上101课那样讲。讲的目的,不是展示你的专业多么复杂深奥,不是为了impress others with your techinal prowess,而是让对方懂,最终听取你的建议。. visit 1point3acres.com for more.
    Data Journalism (course, starting early 2014) — it was not as good as I expected. I do not recommend it. 

作图,静态的最好能会ggplot (a few hours), 动态的d3,如果你会javascript, also great!, 推荐读
     Nathan Yau: books visualize this & Data points, and his flowing data blog.鐣欏璁哄潧-涓€浜�-涓夊垎鍦�
     for d3: Interactive Data Visualization for the Web . free online tutorial by author: http://alignedleft.com/tutorials/d3/about 真的没那么难
    作图是否好看并不是关键所在,选用合适的图标来帮助解释道理才比较重要
html (a few hours, w3c)
css (a few hours, w3c), or codecademy, or the d3 book mentioned above
javascript (codecademy as a start, a book to follow later)

Rcharts/highcharts
Udacity现在也有一门新开的vis课了. 鍥磋鎴戜滑@1point 3 acres

Prototype your data products:
mean stack. https://thinkster.io/angulartutorial/mean-stack-tutorial/
起码把AngularJS学了,这个不光做数科有用。
R open CPU. R Shiny (limited usage with free version).
. from: 1point3acres.com/bbs 
虽然我们不是要做前段开发,但是看起来也得至少有个半吊子前段,请学习这MM的经验,超赞 http://www.1point3acres.com/bbs/thread-104335-1-1.html
Design:  (optional but nice to know) 如果没有兴趣请至少看(组合起来好看的颜色)  如果你有兴趣让图好看,请花一个周末翻看这几本:
    1. Before and After
    2. Nondesigner’s design book
    3. Don’t make me think
    4. The Wall Street Journal Guide to Information Graphics

Research/publication:
sharelatex (invite enough users to get free versioning) /writelatex.com
Go to conferences, see what people are working on. Read their papers.
如果你想找某些类型的工作,上linkedin找到组员,泛读他们的paper
. 鐣欏鐢宠璁哄潧-涓€浜╀笁鍒嗗湴
Domain Knowledge: google/wikipedia is your friend
.鏈枃鍘熷垱鑷�1point3acres璁哄潧
=========================================
整体思路:. 1point3acres.com/bbs
Doing Data science (book)
Data Science in Business
=========================================
other 一些我感觉不太费时间但是会有用的小东西
excel, power pivot etc
科普类的书:(都很简单易读)

大数据到底是啥???http://www.amazon.com/Big-Data-Revolution-Transform-Think-ebook/dp/B009N08NKW/ref=sr_1_1?ie=UTF8&qid=1384931538&sr=8-1&keywords=big+data
和很近似的一本 http://www.amazon.com/Automate-This-Algorithms-Markets-World-ebook/dp/B0064W5UAS/ref=sr_1_8?ie=UTF8&qid=1384931546&sr=8-8&keywords=algorithms
随便翻翻就好了
然后当然还有Nate Silver http://www.amazon.com/The-Signal-Noise-Predictions-Fail-but-ebook/dp/B007V65R54/ref=pd_sim_kstore_1
=========================================. 鐗涗汉浜戦泦,涓€浜╀笁鍒嗗湴
Case study:  Twitter data analytics http://tweettracker.fulton.asu.edu/tda/. 鐣欏鐢宠璁哄潧-涓€浜╀笁鍒嗗湴
=========================================
有人推荐的 MS  data science 学习curriculum  http://datasciencemasters.org/
=========================================大家给我推荐的帮助整理思路,用正确的方式做事的工具:It’s more important than you think!!
http://software-carpentry.org/lessons.html
coursera reproducible research,学转knitr,不要copy paste anything

Udacity Git Course (最好,没有之一). 1point3acres.com/bbs

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s