My first exposure to statistical analysis was from the marvelous book Chaos Theory Tamed by Garnett P. Williams. This book explores chaotic systems, systems that are organized by clearly defined rules yet have seemingly-random behavior and are highly sensitive to initial conditions. In this book he explored many of the fundamental techniques I apply to analyzing time series models such as stationarity checks, analysis for seasonality, periodicity, and trend.
The idea behind chaos theory is looking for trends in information, finding clues that indicate there is a pattern behind the data rather than just random noise. And this technique applies to many other branches of statistical analysis. One of the main goals in modelling is determining if there is some predictability in variables or if they have no effect on one another. Finding these correlations is vital to developing proper models.
Chaotic behavior is often found in places like fluid dynamics and hypothesized to be in systems as complex as the stock market. It is an incredibly interesting phenomena that demonstrates many of the interesting features of statistical modelling.
Monday, April 7, 2014
Universally Selected Hyper Logically Developed Quantum Information Theory (U SHLD QuIT)
A common problem that I have mentioned many times before is the problem of big data. There is an enormous amount of information in the world. We have learned how to harness inputs from a myriad of different fields and the result is more data that can be feasibly handled using classical techniques.
This naturally gives rise to the question, what are some non-classical techniques? And many of these have been discussed before such as a different way of isolating trends or a different method of modelling. These ideas are based on the fact that we have only so much processing power and increasingly large amounts of data. But there is another option. What if we limit our processing technique, but in exchange give it nearly unlimited power? In other terms, running our programs won't tell us the same things, but they will run orders of magnitude faster! What is this amazing technology you ask? Well, welcome to quantum computing.
Quantum computing has suffered a large part of poorly-researched journalism over the years but after focusing my summer research project at Stanford on quantum information theory I feel prepared enough to banish the illusions.
The basis of quantum computing is that the concept of a bit, a "light bulb" that is either on or off, can be slightly changed. In classical computing this idea of on or off, all or nothing is how we store data. Through long strings of on and off light bulbs (or 1's and 0's as they are often known) we can express all manners of ideas. Quantum computing uses physical properties of the universe to make things a little bit more interesting. Instead of a bit being on or off, it has some probability of being on, some probability of being off. Basically that means that we don't know if it is a 1 or a 0 and if we look closely enough we can find out, but without looking closely all we know are these probabilities. (And while this explanation still skates around some major concerns, it is accurate enough for this blog post)
But Ryan, what does this have to do with analyzing data? I'm glad you asked! It turns out that since this bit can have a whole continuous spectrum of probabilities of its on and offs, it can store a lot more data in it. This means that we can put are large amounts of data, translate them into these "quantum bits" and use them for our purposes. But, this comes with a great drawback. Information in a "qubit" is not as accessible as a regular bit. When we "read" qubits, information is lost. It resolves into either a 1 or a 0, and any other information is lost. However, there are certain mathematical techniques that we can use to solve problems faster than we could using classical bits. And thus comes the hope that someday we can use these techniques to analyze large amounts of data in a quick fashion.
SUMaC and Statistics
Mathematics and modelling go hand in hand. Much of mathematics is simply development of models that fit some subset of the universe or categorize some phenomena. So when it comes to statistical modelling a good deal of math is often involved. Yet this is a problem, because as useful as this modelling is, it suffers an enormous scarcity issue due to a challenging problem.
Thankfully I avoided this issue by enrolling in Mrs. Bailey's Category Theory class. There I learned an appreciation of math that I had lacked before and it led me on my path toward the Stanford University Mathematics Camp. And that was where I learned the true meaning of being a mathematician. It is more than creating formulas and equations. These things are often done, but it comes down to more than that. Mathematics is about solving problems in a logical manner. And these techniques are the cornerstone of succeeding in the modern business world.
Friday, April 4, 2014
A Brief Summary of Ryan Smith
One of my frequent human interactions in the last few week has been in the Stanford facebook group. It is composed of the admitted students and after regular decisions came out a week ago the group has been inundated with new people. A common trend is for people to introduce themselves, talking about things they like, and similar events. I eventually decided to take a stab at it and here is my introduction.
I want you all to know that you are influencing me with peer pressure and that is wrong and you should feel terrible. Well, with that out of the way...
Hello everyone, I'm Ryan Smith and if you know a way to put me into a stasis until September please let me know! I am the youngest of 5, an act-a-holic, frequent video game connoisseur, and math enthusiast. That's in fact how I came to apply to Stanford. A few of my friends had attended the Stanford Mathematics Camp (SUMaC represent!) and this led me to apply in 2013 and I proceeded to have one of the best summers of my life. I met lots of amazing people, many of which are in this group, fell in love with the campus, the lifestyle, and the community. From that point on I couldn't see myself going anywhere else and I was and am very relieved that I found out that I was going in December.
On other topics, I've made a second home at my local community theater and have been a part of over 40 performances in the last 4 years and if there is anything that I am going to miss it will be my lovely Fountain Hills Theater.
The other major influences in my life have been Warcraft III, my first major online game, WoW and LoL as a place where I found many of my closest friends, and my obnoxious older sister who has shaped my mind to her own purposes.
I"m going to major in Mathematics and Computer Science and love learning about all of the amazing technologies we use. Anyway, that's me, hi. How are you?
Abstract
Today I am showing my abstract for my SRP presentation. This is the first step toward my actual presentation which I will present in May. Without further adieu here is my abstract.
As the world of data analytics becomes increasingly vital to
the business world, many corporations are utilizing it to streamline their
marketing, sales, and development departments. This research project explores
the data manipulation techniques and tools used by software giants like Google,
Facebook, Amazon, and Netflix to market their product and improve their
services. These companies utilize petabytes of information that ranges from
data on their clients to marketing trends of certain products and this
information requires proper handling to prove useful. There are many different
approaches to analyzing this data such as time series analysis or regression
modeling and as time progresses even more advanced techniques are being developed.
The research on this topic was conducted by analyzing the tools used by these
companies, such as sentiment analysis and segmentation modeling and the tools
used to manage data in general such as SQL and R. The purpose of this project
is to provide a perspective on how important information management is to the
modern world and shows that the new techniques in data analysis are critically
important to success as a major business.
Update on Life
Today I'm giving a general update on things I've been doing for the past few weeks. I've been learning a lot about the programming language/analytics tool R which is enormously useful for creating models and processing data. It shares common features with many languages like C+ or Java and only requires learning a little new syntax. It's made a number of my projects easier.
This last weekend I learned all about the mathematics of sound design helping my theater set up for their annual fundraiser Broadway in the Hills. The gist of it is that setting up a temporary acoustic environment in a day is enormously challenging and requires a LOT of wiring.
In terms of colleges last week was the D-Day for a lot of schools and I am happy to announce that I was rejected by all of the other high end schools I applied to including Harvard, Caltech, MIT, and Harvey Mudd. While slightly saddening I can understand their decisions as my applications may have suffered after I was accepted into Stanford in December.
On top of my internship I am currently a part of 3 performances of the Fountain Hills Theater. I am running sound for the comedy The Man Who Came to Dinner, student stage managing The Little Princess: Sara Crewe, and performing at Papa Vito in our annual Murder Mystery event, Bellamorte! I do these with a mix of pleasure and pain as I know that there will not be many more chances for me to spend time at my home away from home for these past 4 years but I hope to go out with a bang!
This last weekend I learned all about the mathematics of sound design helping my theater set up for their annual fundraiser Broadway in the Hills. The gist of it is that setting up a temporary acoustic environment in a day is enormously challenging and requires a LOT of wiring.
In terms of colleges last week was the D-Day for a lot of schools and I am happy to announce that I was rejected by all of the other high end schools I applied to including Harvard, Caltech, MIT, and Harvey Mudd. While slightly saddening I can understand their decisions as my applications may have suffered after I was accepted into Stanford in December.
On top of my internship I am currently a part of 3 performances of the Fountain Hills Theater. I am running sound for the comedy The Man Who Came to Dinner, student stage managing The Little Princess: Sara Crewe, and performing at Papa Vito in our annual Murder Mystery event, Bellamorte! I do these with a mix of pleasure and pain as I know that there will not be many more chances for me to spend time at my home away from home for these past 4 years but I hope to go out with a bang!
Sunday, March 23, 2014
Sentiment Analysis
Every day we create over 2.5
quintillion bytes of data. This is over 20000 times the size of the
English text version of Wikipedia. This is information from Facebook
statuses to Tweets to product reviews to millions of different
things. Now say that you are a company and you want to find out the
public opinion on something be it a product, a politician, or a
medical procedure. It is not easy to have a person read the
equivalent of 20,000 wikipedias to decide if people like something or
not. And while much of this data is irrelevant, it is often very hard
to know where to look for your data. And even if you do, if you know
you want to look at every tweet in the last week and find the opinion
on something, well searching every tweet made in the last week would
still take more man power than most countries let alone businesses
can provide.
Here comes sentiment analysis.
Sentiment analysis is a technique that uses computers to analyze text
to judge opinion. Your initial thought might be "Well, that
shouldn't be too hard, and computers work a lot faster than people."
You would be slightly right and slightly wrong. Getting computers to
recognize a human concept just from the words used is a very hard
task. A first approach would be to look for positive or negative
words in relation to your product. But what if someone says "I
would hate for someone to live without this product" or "If
you enjoy pleasant day of sizzling the skin off your feet or eating
food so lively that you get dysentery, this vacation spot is the
place for you!" Sarcasm is a complex linguistic process that
many humans fail to understand, let alone machines.
And yet this is what these computers
do. They analyze text to judge public opinion and companies look at
the results and make decisions based on what they find. These
techniques are incredibly versatile and used in a myriad of ways in
the electronic world.
Time Series Modelling
Time series are incredibly useful tools
for modeling systems. Time series are basically representations of
variables that change over time. They can be used to model ocean
currents, stocks, population, and pretty much everything that changes
over time.
Construction of time series is done by
analyzing past data for a number of trends. These things can be as
simple as is the data cyclic as in does it repeat a pattern over some
time interval. Or it can be more complex such as having various
frequency dependencies that cause various smaller cycles to occur
within a larger cycle.
Some time series are chaotic in nature
meaning that starting out with similar but not equal initial
conditions can yield large differences in their progressions over
time. Many natural systems are chaotic such as water flow during a
storm, double pendulum machines, or turbulence in a vortex.
Time series can also be used to model
systems that change with respect to other variables over time. This
way it can model things like the stock market which changes due to
many variables such as inflation or earnings. Developing an accurate
time series model then allows extrapolation to future events and
allows for predictions to be made. This also shows some of the
limitations of the theoretical uses because clearly we do not have
accurate predictors of the stock market.
This occurs because there are so many
variables that affect our system that we cannot perfectly model the
system. Generally we settle for approximations of systems which gives
us a general idea but does not give perfect results. We construct
these models to allow general predictions to be made and we strive to
improve our models as this gives us results that are closer and
closer to reality.
SQL and You
If you have ever used the internet in
any way you have interacted with SQL and probably don't know it. SQL
is an amazing tool used in the backbone of the internet to access
data. SQL is the communication line between data stored in a table
and a user's screen where the data is wanted. Data such as your
personal information on your facebook profile, your tastes as
catalogued by Netflix, frequently searched terms from Google, or item
types you've shown an interest in to Amazon. All of this information
is stored neatly in tables in a server and SQL is the key to getting
it where it needs to go.
SQL is a programming language that is a
key part of website design. SQL commands can store information
inputted by an individual, recall data from a table, and check
various conditions in user and website variables to alter commands
accordingly. SQL is highly versatile and can be incorporated directly
into a websites backbone with html support. What this means is that
the very code that describes the layout of a website can have pieces
of SQL that handle movement of data from server to client.
SQL is heavily involved with targeted
advertisements. Companies that have data on you such as facebook or
google will store this information in a table. When web pages are
loaded and advertisements are selected they are picked so that the
advertisements have characteristics that have appealed to you in the
past and have been documented in their databases.
Thursday, February 27, 2014
Segmentation
Segmentation is a key goal of many
service providing companies and is a large component of both my
research project and my work with Axtria. Segmentation is the process
of dividing up clientele into subgroups based on available
characteristics. It has uses that range from identifying prospective
donors to targeting ads based on an individual's interests.
Groups can be segmented in a number of
ways. Say for example you have a list of prospective donors and a
list of past donors with a number of characteristics for each. You
may know for both groups things like age, estimated income,
connections with the organization, and many other variables. And you
also have information on the donations of past donors such as amount
given, frequency of gifts and things like this. Segmentation would
work by trying to find a correlation between characteristics we care
about (donation information) and other information about the donor
base so we can know which potential donors would be more probable to
donate so that more time can be focused on them.
Another example would be how Netflix
uses segmentation. Netflix has an immense user base which they have a
variety of information on from age, to categories of interest, to
individual media that they found incredible or horrific. Thus, when
new people sign up for Netflix and list their interests Netflix
pattern matches them toward television shows and movies that they are
most likely to enjoy. As customers watch and rate more Netflix has a
larger data set and can make better and better suggestions.
Segmentation is but one small tool used
in data analysis but it proves very useful when applied to the right
circumstances.
Monday, February 24, 2014
Excelling with Excel
The most prominent thought that has
occurred to me many times since starting my internship is the fact
that Microsoft Excel is a beautiful tool to work with. While far from
perfect, the number of amazing, clever, and plain useful features it
brings is astounding.
At its core Excel appears to be just a
sophisticated spreadsheet: a tool for putting data in nice little
rows for easy consulting with a lot of high-tech looking features
that are too complex for actual use. But after a little instruction,
the intricacies of Excel reduce labor and minimize pain.
Excel can do basic things like sum the
items in a column or multiply the values in a row but it has much
more power than this. In a table listing population of European
countries it can color code the largest and smallest, tell you the
average population, graph the distribution of population, and tell
you how many country names include the letter 'o'. With a table of
financial information it can tell you what attributes contribute the
most toward revenue, which factors are nearly irrelevant and how to
generate the most profit.
Excel is a wonderful tool to analyze
data though it does have its limits. For exceedingly large data
tables Excel begins to run very slowly. While its user-friendly
graphics tend to help with understanding they also require computer
resources based on the number of fields entered. But even with this
drawback Excel is a highly useful tool for basic analysis and it
erases much of the drudgery from data crunching.
Friday, February 14, 2014
The Wonderful World of Corporate
This week I discovered that for having two parents that have spent the majority of their lives working in the corporate sphere I am remarkably ignorant of the corporate world's functionality. And I guess that after a lifetime of limited exposure to the consumer aspect of business life that this should not be such a surprise.
Over the past week I've been exposed to a number of facets of the modern business. From lectures on employee efficiency to conferences calls with heads of IT I have learned much about the internal structure of modern corporations.
This is a key feature to understanding how businesses use data analytics. Many different departments work together to produce useful output. R and D must interpret data and identify significance, Engineering must figure out how to manufacture this change, Marketing must find a way to publicize this technology. All these independent sub-units must cooperate and listen to each other to develop a useful product.
Information sharing is crucial for cooperation and it brings many issues to the table. Even the simple communication of data is complicated. Data may be in Excel Spreadsheets while the analysts might want to look at it in R. Getting many people from many organizations on the same page is a difficult task and involves much communication so that services can be smoothly rendered. And all of these things coalesce to make the corporate network a spiderweb of interconnecting chaos.
Tuesday, January 21, 2014
The Beginning of the End 1/21/14
As the second trimester winds down we seniors find ourselves preparing to journey far and wide to fulfill our Senior Research Projects. My journey is electronic rather than geographic as I work alongside the data analytics company Axtria and learn about the field of data analysis from the comfort of my own home as Axtria does most of their work online.
I am Ryan Smith, future student of Stanford University, passionate mathematician and novice computer scientist. Using the techniques I learn while working with Axtria will help me research how companies use data to target their customers, improve their business practices and make large sums of money. For more information on what I am researching you can view my entire proposal here: https://drive.google.com/file/d/0BzhSjAYafblcX3pzR1lxQlVyb2c/edit?usp=sharing.
Starting with the end of the second trimester on February 7th, I will start my project by beginning my internship with Axtria. I will also be investigating the various applications companies like Google, Amazon, Facebook and Netflix use to manage data and what they get out of it. This internship and research project will be my first step toward life as a college student studying mathematics and computer science at Stanford University!
You can also find more information on the company I am working with, Axtria, here: http://axtria.com/.
I am Ryan Smith, future student of Stanford University, passionate mathematician and novice computer scientist. Using the techniques I learn while working with Axtria will help me research how companies use data to target their customers, improve their business practices and make large sums of money. For more information on what I am researching you can view my entire proposal here: https://drive.google.com/file/d/0BzhSjAYafblcX3pzR1lxQlVyb2c/edit?usp=sharing.
Starting with the end of the second trimester on February 7th, I will start my project by beginning my internship with Axtria. I will also be investigating the various applications companies like Google, Amazon, Facebook and Netflix use to manage data and what they get out of it. This internship and research project will be my first step toward life as a college student studying mathematics and computer science at Stanford University!
You can also find more information on the company I am working with, Axtria, here: http://axtria.com/.
Subscribe to:
Comments (Atom)