Studying lists about data science

数据科学:简单说就是,不要靠拍脑袋下结论,要以数据为根据,让事实说话。

能力范畴3个词:统计编程表述


A PhD Data Scientist: Jack of All trades, master of one.

展开说:统计(能探索数据,建模,设计实验),. 鍥磋鎴戜滑@1point 3 acres
编程(能取数据,洗数据,至少能Prototype自己的data solution,懂基本大数据工作原理(MapReduce)),
表述(化繁为简,口头Present,书面写报告和论文,作图(静态和web))


简历上(+脑子里)如果有这些:你找工作基本没有问题:
Ttest, Regression, ANOVA, Logistic Regression, DOE, Machine Learning, Data Mining, MapReduce, SQL, R/Matlab, Python, Java

=========================================.鏈枃鍘熷垱鑷�1point3acres璁哄潧
本文主要针对IT类行业做数据科学 It does not define a data engineer. Rather, it’s a close call to a “full-stack data scientist”. Master this list and you will not only be able to work for established firms, but startups too.
其他偏重传统行业应用的,应该对表述要求稍高,对其他要求稍低。
面试之前请务必花1周时间学习对方行业的基本内容,wikipedia即可,起码做到熟悉对方行业常用关键字。
如果目的就是有份还可以的工作,请照单子静下心学习。
如果你希望做的很好,三个方面请突出至少一个方面。
要学过来,需要很多时间,如果希望不太费力就做data scientist, OK, dream on!

请不要mark一份学习清单就.Equals(学习任务已经完成了)一样,一起来学起来吧~~~~~~
【墙裂建议贴出你的学习计划,大家一起监督讨论,几位版主有空也会来给建议,坚持下来的有积分奖励】
=========================================
如果有不清楚的请多google.

=========================================
差不多一年前看市面工作还是很混杂的样子,今天又翻了翻,估计年底账目清算,很多公司很多新职位出来了,职位要求解析在此
感觉现在data scientist/researcher之类职位针对性更强,能更清楚看出来到底对方需要的是什么样的人:是啥都会一点的,还是会点统计的码农,还是Machine learning,还是优化、logistics 供应链,还是会点编程的统计师。. 鐣欏鐢宠璁哄潧-涓€浜╀笁鍒嗗湴
(data business person 一般不叫data scientist) 主要用SQL产生报表的BI analyst 也不在此列。
. 1point 3acres 璁哄潧
学习列表一来是准备面试用,二来本来平时就是要用的。我自己学完的mark as green
=========================================
打算把我自己学的一些东西总结在这里欢迎补充。不定期汇总到首楼。
如果你想收藏本帖请点首楼下方的“收藏” -》 确定 -》 然后文章会出现在 “快捷导航”-》收藏里面
如果没有啥具体内容要补充的,请不必回帖了。想加分的可以加分,不加也无所谓。

请别问我某校的Data Science项目如何,你三围如何能否上某校。I have no idea.
. 鐣欏鐢宠璁哄潧-涓€浜╀笁鍒嗗湴
=========================================. 涓€浜�-涓夊垎-鍦帮紝鐙鍙戝竷
基本上是must have:

统计Statistics 统计和机器学习
hypothesis testing, point/interval estimation
pvalue, power, (type 1/2 error)
clt, delta method, derive coef and var(coef) etc
t-test: assumptions, remedy. 适用问题范围basics listed above 请看这个课 http://onlinestatbook.com/2/index.html
glm (lm, logistic regression, anova etc):asssumptions, model selection and validation, diagnostics, remedy 适用问题范围

  times series         Forecast with R
Time Series Analysis and Its Applications: With R Examples (Springer Texts in Statistics)
and its Upitt course

bayesian
Bayesian for hackers (python)
Coursera Graphical Model (VERY nicely explained)
Bayesian reasoning and machine learning book (quite difficult to read)
入门:A first course in Bayes 一下就看完了,很不错

longitudinal, mixed model
doe:all kinds of design, response surface
(?)survival

Machine Learning        Coursera Andrew Ng. 鍥磋鎴戜滑@1point 3 acres
stanford Statistical Learning (Tibshrani & Hastie)
        — 本书还出了一个本科版,着重动手实践,大量R, very easy to read. recommend starting from here. 
Caltech那个learning from Data我没能跟下来

统计软件Statistical Computing: R/Matlab/Python. SAS(?)
R and Matlab 基本被业界认为是等同的。不过Matlab is not free, Octave is free 但是不是那么好用。请考虑自学R。反正你会Matlab 的话pick up R 也就分分钟的事情。
如果其他语言一个都不会,只会SAS Base/Stat,并且你也不想学其他的,那也许数据科学不适合你。如果你非要用SAS不可,请你至少写过macro。SAS的确在大数据的建模里面非常有用,但是跟其他行业差距较大,如果组里其他人都是R/Py/Java 你跟他们交流起来会异常困难。另外软件很贵,很多地方未必愿意买。
注意,我说的是,会SAS是好事,但是不能仅仅只会SAS.
Python: Data Analysis with Python (book), pandas
R: data.table, or plyr, lubridate, reshape2, build a R package, there are now lots of such courses on both udacity and coursera. Start from any.
know how to get data from any source (DB, web, xml, plain text, etc)
EDA (exploratory) – Descriptive stats udacity
Inference – udacity
Plot/explain
read code from your favorite packages. 鐗涗汉浜戦泦,涓€浜╀笁鍒嗗湴

—————————————————–
编程 : A compiled language, and a scripting language
Python 
我比较偏好Udacity一遍教一遍做quiz 的方式,光做题不讲(codecademy)我自己好像学不清楚
    Udacity CS101. Waral 鍗氬鏈夋洿澶氭枃绔�,
    Udacity CS 215 (Algorithm, 比Coursera Princeton and Stanford要简单,快速过一遍不错)
    Udacity (Peter Norvig) CS212 Design of a Computer Program 非常好,强烈推荐

Java 数据结构和算法
1. Udacity java (这门课我花了40小时学完)适合连什么是函数什么是赋值都不知道的人。
2. Data structure 数据结构建议必学       python: Problem Solving with Algorithms and Data Structures)
Java:  Berkeley 61B http://www.cs.berkeley.edu/~jrs/61b/
教材是Head First Java & Data Structures and Algorithms in Java,. 1point3acres.com/bbs
my progress bar: week 5, lab1, hw1.
3. Algorithm:                  Udacity Algo in Python 比较laid back,如果不太希望费劲,可以上这个课,不过还是严肃点好。。。
Java Coursera Algo I&II (Princeton),如果对这个话题有兴趣,
                  不限语言 Stanford Algo I&II也很好,两者不可相互代替。
.鏈枃鍘熷垱鑷�1point3acres璁哄潧
很少会有人学的第一门语言是C#,所以C#还真没有什么特别入门的书,不推荐。如果没从前没学C, java, C++直接看C#的书简直无法理解
C++比较难,对data scientist 来说应用也没有java广。当然如果你是大牛,plz当我没说。

根据我组里面试别人,和我在其他地方面试,量化一下:数科的编程到底需要什么水平?
我假定你有了上述其他的全部功底,除非职位特别强调是统计师,或者叫Data scientist, statistics/analytics,并且职位说明里面对代码完全一带而过,你都可以假设,是需要一些代码能力的 。
具体水平是:
IT公司数科:Leetcode Medium要可做。所以,刷题吧。 
传统公司:不知道.1point3acres

如果你是码农出身,或者做更偏向data engineer的,要求会更高

涉及知识点包括并且不限于:
浮点溢出
边界情况考虑
改进MapReduce算法(beyond brute force)
如果涉及大数据,对时间复杂度要求会比较高
— 其他我想起来了慢慢补

顺手学掉的小零碎:
Regex (a couple of hours) http://deerchao.net/tutorials/regex/regex.htm



SQL (a week) http://www.w3schools.com/sql/    Coursera: Intro to DB

大数据:
MapReduce: some knowledge    Udacity series:    http://blog.udacity.com/2013/11/sebastian-thrun-launching-our-data.html    Coursera: intro to Data Science  
    Coursera: Big data and web intelligence
    learning by doing — yes! wrote my very first reducer for real life projects!    MongoDB (udacity). 涓€浜�-涓夊垎-鍦帮紝鐙鍙戝竷

If your want to be a DS for IT firms, then Maybe:
jquery/ajax (start from codecademy very simple js and jquery intro, then find books)
—————————————————–
web services   get basic idea of how browsers work (udacity – Website Performance optimization). visit 1point3acres.com for more.
udacity web development (build a blog) (40 hours)
—————————————————–
SE.鐣欏璁哄潧-涓€浜�-涓夊垎鍦�
   Software Development Life Cycles (udacity, mostly videos, as a quick intro only), amazingly, this one filled lots of holes in my knowledge base. Highly recommend
Also a book is mentioned here, worth a quick flip through, unfortunately, no ebook that I found works. Martin Fowler, Kent Beck, John Brant, William Opdyke, Don Roberts-Refactoring_ Improving the Design of Existing Code

— this is helpful not only for working in IT, but helps overall coding style/efficiency as well. Wished I’d known earlier.
—————————————————–. visit 1point3acres.com for more.
Linux
Many servers are in linux. at least familiarize yourself with the command line stuff. There’s a not so good course on Edx.
—————————————————–. visit 1point3acres.com for more.
综合/分析/表述/软技能
    软技能难以表述,
技巧不是最重要,想清楚再开口才是关键。突然发现我导师的lab页面竟然是用这些问题开头,深感心有戚戚。

化繁为简,高屋建瓴的表达能力:hide complex formula/engineering details,尽量传达big picture
    个人经验是,习得这些能力最好的办法是:去讲,不要自顾自的讲话,请随时关注听众是否听懂,鼓励对方马上提问,回答问题要选取符合对方背景的关键字,而不是“自己熟悉”的关键字。不要用缩写,小范围术语。多讲清楚intuition,少堆积公式。
    1. 教一门自己专业的入门课,e.g 统计学生,去给其他专业的人讲入门统计,例子:请给完全不懂统计的人讲,什么是pvalue, power, false positive, randomization, inference etc. 
    2. Consulting – 有些学校会有这种session,别觉得浪费时间,去把别人讲懂,去看看别人用你的专业技术做什么问题,他们的思路跟你哪里不同,你如何理解他们,如何让他们理解你。
    3. 做presentation – 不要像专业学术会议上那样去讲,要向给别人上101课那样讲。讲的目的,不是展示你的专业多么复杂深奥,不是为了impress others with your techinal prowess,而是让对方懂,最终听取你的建议。. visit 1point3acres.com for more.
    Data Journalism (course, starting early 2014) — it was not as good as I expected. I do not recommend it. 

作图,静态的最好能会ggplot (a few hours), 动态的d3,如果你会javascript, also great!, 推荐读
     Nathan Yau: books visualize this & Data points, and his flowing data blog.鐣欏璁哄潧-涓€浜�-涓夊垎鍦�
     for d3: Interactive Data Visualization for the Web . free online tutorial by author: http://alignedleft.com/tutorials/d3/about 真的没那么难
    作图是否好看并不是关键所在,选用合适的图标来帮助解释道理才比较重要
html (a few hours, w3c)
css (a few hours, w3c), or codecademy, or the d3 book mentioned above
javascript (codecademy as a start, a book to follow later)

Rcharts/highcharts
Udacity现在也有一门新开的vis课了. 鍥磋鎴戜滑@1point 3 acres

Prototype your data products:
mean stack. https://thinkster.io/angulartutorial/mean-stack-tutorial/
起码把AngularJS学了,这个不光做数科有用。
R open CPU. R Shiny (limited usage with free version).
. from: 1point3acres.com/bbs 
虽然我们不是要做前段开发,但是看起来也得至少有个半吊子前段,请学习这MM的经验,超赞 http://www.1point3acres.com/bbs/thread-104335-1-1.html
Design:  (optional but nice to know) 如果没有兴趣请至少看(组合起来好看的颜色)  如果你有兴趣让图好看,请花一个周末翻看这几本:
    1. Before and After
    2. Nondesigner’s design book
    3. Don’t make me think
    4. The Wall Street Journal Guide to Information Graphics

Research/publication:
sharelatex (invite enough users to get free versioning) /writelatex.com
Go to conferences, see what people are working on. Read their papers.
如果你想找某些类型的工作,上linkedin找到组员,泛读他们的paper
. 鐣欏鐢宠璁哄潧-涓€浜╀笁鍒嗗湴
Domain Knowledge: google/wikipedia is your friend
.鏈枃鍘熷垱鑷�1point3acres璁哄潧
=========================================
整体思路:. 1point3acres.com/bbs
Doing Data science (book)
Data Science in Business
=========================================
other 一些我感觉不太费时间但是会有用的小东西
excel, power pivot etc
科普类的书:(都很简单易读)

大数据到底是啥???http://www.amazon.com/Big-Data-Revolution-Transform-Think-ebook/dp/B009N08NKW/ref=sr_1_1?ie=UTF8&qid=1384931538&sr=8-1&keywords=big+data
和很近似的一本 http://www.amazon.com/Automate-This-Algorithms-Markets-World-ebook/dp/B0064W5UAS/ref=sr_1_8?ie=UTF8&qid=1384931546&sr=8-8&keywords=algorithms
随便翻翻就好了
然后当然还有Nate Silver http://www.amazon.com/The-Signal-Noise-Predictions-Fail-but-ebook/dp/B007V65R54/ref=pd_sim_kstore_1
=========================================. 鐗涗汉浜戦泦,涓€浜╀笁鍒嗗湴
Case study:  Twitter data analytics http://tweettracker.fulton.asu.edu/tda/. 鐣欏鐢宠璁哄潧-涓€浜╀笁鍒嗗湴
=========================================
有人推荐的 MS  data science 学习curriculum  http://datasciencemasters.org/
=========================================大家给我推荐的帮助整理思路,用正确的方式做事的工具:It’s more important than you think!!
http://software-carpentry.org/lessons.html
coursera reproducible research,学转knitr,不要copy paste anything

Udacity Git Course (最好,没有之一). 1point3acres.com/bbs

Advertisements

II. How Should I Live in This World: from Anxiety to Peacefulness 我应该怎样存在:(二)从浮躁到平静

I’ve now been studying at UT and living in Austin for a whole year. I cannot help reminiscing about the beginning of the spring semester 2014, when things happening on me in the other half of the world appeared to have been new, unfamiliar, and sometimes hard to handle.

I was looking down from the window of a small interstate airplane. Texas’s land is mainly of dark yellow and dark green color. It was in January; nevertheless I seemed to still feel the heat waves hanging over near this vastness. “Is this indeed the city the soil I will abide in for the next one and a half year before graduation from college? Will I possibly like it?” I was murmuring inside. Texas was too sparse, big, and “barren” to a student who had a fairly long living and academic stay in cities like DC and California. “Anyway, this is my choice and the optimal choice at that moment.”

I was anxious about my life in a new environment. Life exposed his complex and various sides to me which are no longer restricted to grades but issues like which apartment is ideal to pick for next semester, what to buy to cook for meals, how to self protect when I’m in potential danger because of walking homebound alone from the library at midnight, how to make friends who have implicit differences of outlooks from me and have explicit differences of languages used for communication…… Some part of me was so uneasy about all the miscellaneous things popping up everyday that I was so discreet of tackling them down one by one: I moved twice in two months from the west of city of Austin to the east and finally to an apartment near school…… I was on my own at a young age of 19. I like this sentence: “the quickest way of learning new skills is by expanding your comfort zone.”

In the first semester, I met kind and friendly peers and seniors who greatly helped me smoothly transit to regular life. I encountered a great but disputable existence in this world because of whose love I have received selfless love and care from strangers and have been trying to output equivalent attendance to others. I got a 4.0 GPA in the first semester and got an offer from a NGO to campaign for water conservation nationwide. I got an offer to study abroad in Botswana for a month during summer under the department of geography. I knew about a well known and very kind and approachable professor at UT in geophysics– Sergey Fomel and attended his software workshop at Rice University. During summer, I worked in the research team of Columbia University in the department of political science, surveyed the spanish people living in Austin, and used these data to test several hypotheses. However, what I did not do very well was the three summer courses: US government/ Texas History/ US History. The possible reasons are: 1. it was during summer when I was on the one hand part time working and on the other hand studying. What I should enhance is the skill of not only handling multiple tasks but more importantly dealing with them in a way as perfect as expected. 2. I thought it was not relevant to my interest and not taken at UT but the Austin Community College and after a long semester I felt tired. What I should improve is, first my sense of responsibility– to myself and also to my parents’ money and second my respect to knowledge– no matter where it was imparted to me I should treat it as equal and the same. 3. No close friends around or friends are travelling around to places I want to visit as well. What I should  avoid is the negative emotions invoked my others’ happiness gained from doing things I desire to do– If I long to travel, then I should make plans and learn to travel safely on my own, and second I should learn to have fun even when I’m alone in a new environment,  to drive out the unhappiness/isolation– I should utilize the internet to check out the recent events happening or are going to happen in town, so on and so forth.

During the second semester at UT, I took five classes composed of four math and one cs class and simultaneously I was auditing Professor Sergey’s class. I had my first research project, small as it was, with my mentor Sona through the Directed Reading Program at math department and did my presentation in 15 mins in front of my peers and professors. I got an interview with Dell on the position of Software Engineer. I got an interview as a tutor at Sanger Learning Center. I got an interview as an Outreach Assistant at Sanger Learning Center. I got to the second round of membership of Texas Undergraduate Computational Finance. I got 4 As for four of my classes and one B+ for Probability. I applied for Math Honors Program. I registered Austin Half Marathon in this February.

What I have not done perfect are as follows:

1. I made several attempts to read Professor Sergey’s papers and was hoping to get started my undergraduate research with him, but I hadn’t got any improvement about my independent research.

2. I was auditing the class but I was not able to take the final of the class.

3. I failed all the interview I got, and didn’t get up to go to the Outreach Assistant interview in the morning.

4. I was kind of slacking off during the first half of the probability class, so even though I was ranked pretty high in the final accumulative exam, I was able to pull up my grades all by one exam.

5. I was not running or working out regularly as expected.

How to improve:

1. Since now I’m pretty sure what my interest is, geophysics, I should focus the majority of my time and energy on it instead of hitting on too wide a field from Software Development to Sociology. The more you input the more you output.

2. I should lay down myself and do not be too self conscious– Don’t put too much attention on the result and thus be afraid of failure.

3. I should talk to Sergey more about what I’m concerned about, what I did not understand, and how to work for him, and his standards for phds. I should cherish this great resources. I may go to Pickle once every two weeks to talk with him.

4. Refining my resume and my interview skills. Getting as many interviews as I can through websites like Indeed/Linkedin– Remember, I’m young, and I’ve got nothing to lose.

5. Learn to quickly adapt to classes of different styles and forms. Don’t find excuse of laziness. Keep up the good work from the beginning to the end.

6. To merge regular work-out as part of my life.

However unsatisfied I am about my own life, I still thank life for bringing different interesting and respectful people in my life. Jun Zheng was a tough guy who came from a farmers’ background and was truly totally on his own ever since his university life; now he is a L2 student at UT Law School particularly studying the law of Intellectual Property. Heidi Zhang, now applying for UT grad program,  graduated from the same high school as I did and has had a energy loaded and adventure fueled life on the wicked trails among mountains and woods. Lyra Hao was a Phd student at Stanford now in Geophysics; she is a free spirit trespassing the majority land of the world and catching countless wonderful moments by her professional level photography skills. Paul and Judith are an old but young couple who taught me how to face the twists and turns of life and who infuse me the peacefulness/composure. (CONTINUED)

Now I’m in China with my parents. I was not peaceful here. I can only regain the peacefulness, short it might be, there at UT where some part of me, be it the heart or seoul is on the way to the truth and freedom of life, as time never stops for my regret and idleness.

CONTINUED……

I. How Should I Live in This World: From Shanghai to Austin US (我应该怎样存在:(一)从上海到美国)

去年的今天,我在上海和华侨基金会的伙伴们,jojo,veronique度过新年。霓虹的星星点点装点着上海滩。我在虹口区的国峰大厦顶层,华侨基金会的会议厅内,吃着零食,唱着卡拉OK,唱走过去四个月的在上海,在越南河内,在美国华盛顿,在波兰华沙,在中国上海中国成都的经历的劳累,奔忙,焦虑,也感激着这四个月带给我的心智的收获,人生价值观的完整。我期待着一周后飞向另一个城市,德州奥斯汀。

2013年七月初到八月中旬,我在美国华盛顿的乔治城大学度过。我通过原母校,即上海外国语大学,提供的面试和考试的机会,获得去GU交流学习一个月的奖学金赞助机会。短短的一个多月时间,我需要迅速适应新的语言环境,两门文科课程,即美国宪法基础和哲学入门对于一个英语非母语的我来说,并不轻松。我一直对英国议会制辩论感兴趣,选择这两门课也私以为会对我的辩论技巧和内容上有所拔高。但真正一板一眼的学起来,我似乎并不很享受。老师上课所涉及的美国文化生活,比如提到的一些美国大事件,调侃的当时在美国范围内为人熟知的人名地名,总是让那个坐在前排竖起耳朵的我一头雾水。即使没办法完全听懂上课时老师说的每一句话,我完成的法律写作仍然能得到A-到A的成绩。这种让我受到挑战,不舒服的氛围让我感到兴奋鼓舞。我回想在上海读大学的我,总是班上的佼佼者,每学期特等奖学金总有我的份。我又想着我学习的专业,英语口译和笔译,我真的希望以后从事这方面的工作吗?

八月十号,我回到了成都的家,我知道我心中转学的想法早已蠢蠢欲动,那为何要动摇?想到了就去完成,没有什么来不及,那只是自我逃避害怕失败的借口罢了。那时我还有二十天准备托福考试,不,其实是十五天,因为有五天我会去英国议会制辩论训练营当助理教练。虽然时间紧迫,但我告诉自己,我一定可以取得我想要的分数。

九月一号,我回到了上海开始2013年秋季学期的学习。这一学期对于我有很重大的意义,为了转学,我既要维持高GPA,同时,我必须分配出足够的时间准备完成SAT考试。除了这些标准化考试,我需要去遇见不同的人经历不同的事,写出能展现自己思想和个性的文书。也许是想着转学之后的学习生活,我充满力量和精力:还准备着考完SAT后一个星期去波兰华沙参加联合国气候框架会议第十九次各方代表大会。

九月份开始,我每天早上六点半起床,除了正常的上课,图书馆便是我的第二个寝室。SAT虽然是美国的高中生考试,但对我来说并不是小菜一碟。我寻求搜索着各种SAT考试技巧和资源,每天作出计划,尽力完成。我记得很清楚,到过完国庆十天(学校空空荡荡,我哪儿都没去,图书馆不开门,我每天就去教室和考研的哥哥姐姐们自习),我才算真正把单词关过了开始刷题,那时距离我到越南河内考试只有一个月整,距离我到华沙参加论坛有一个月零两周。我没有任何人可以倾诉,每天像个不问世事的女疯子。我记得一次难过至极,在图文大厅给爸妈打了个电话,我放声大哭,根本不顾周围看书的同学,我说我不喜欢我现在的状态,我处在的氛围,我的专业。我要转学去学习我想学的,我想过的生活。爸爸说,你自己看着办吧,你走得了,我们就供,走不了,烂摊子自己收拾。世界好似与我为敌,我别无退路。(由于高中入党,我被分配党章学习小组的任务,但是分身乏术,我几乎没有尽到任何责任,党员身份岌岌可危,爸妈各种施压。。。)

我对于选择学校了解不深,但是我知道我的大方向是ivy不成(现在想来当时真片面),就去理工科较强的学校。那个下午,我在顾悦老师办公室聊到我在准备转学的事情,他随口一句,你去试试德州奥斯汀,我在那儿做过交流学者,是一个很适合你的学校。我回去查了查,随手申请了他的2014春季转学,想着就算是保底校。

十一月一号,我从上海飞向河内,在皇冠酒店完成了五个小时的SAT。十一月十四号,我记得那是星期五,我早上上完两节课后赶向浦东机场飞往华沙。路上的各种drama,不确定性,以及在肖邦机场赶掉飞机就不一一赘述。

一切有条不紊进行着,我已经感觉不到劳累,只有内心的无限希望和坚定不移。

十一月底,我收到了德州奥斯汀的录取通知。录取了社会学和数学专业,正是我所想的,我想在本科阶段成为一个全面的人。

原来这句话是真的:你想去哪里,全世界都会为你开路。