Monday, April 7, 2014

Order within Chaos

My first exposure to statistical analysis was from the marvelous book Chaos Theory Tamed by Garnett P. Williams. This book explores chaotic systems, systems that are organized by clearly defined rules yet have seemingly-random behavior and are highly sensitive to initial conditions. In this book he explored many of the fundamental techniques I apply to analyzing time series models such as stationarity checks, analysis for seasonality, periodicity, and trend.

The idea behind chaos theory is looking for trends in information, finding clues that indicate there is a pattern behind the data rather than just random noise. And this technique applies to many other branches of statistical analysis. One of the main goals in modelling is determining if there is some predictability in variables or if they have no effect on one another. Finding these correlations is vital to developing proper models.

Chaotic behavior is often found in places like fluid dynamics and hypothesized to be in systems as complex as the stock market. It is an incredibly interesting phenomena that demonstrates many of the interesting features of statistical modelling.

Universally Selected Hyper Logically Developed Quantum Information Theory (U SHLD QuIT)

A common problem that I have mentioned many times before is the problem of big data. There is an enormous amount of information in the world. We have learned how to harness inputs from a myriad of different fields and the result is more data that can be feasibly handled using classical techniques.

This naturally gives rise to the question, what are some non-classical techniques? And many of these have been discussed before such as a different way of isolating trends or a different method of modelling. These ideas are based on the fact that we have only so much processing power and increasingly large amounts of data. But there is another option. What if we limit our processing technique, but in exchange give it nearly unlimited power? In other terms, running our programs won't tell us the same things, but they will run orders of magnitude faster! What is this amazing technology you ask? Well, welcome to quantum computing.

Quantum computing has suffered a large part of poorly-researched journalism over the years but after focusing my summer research project at Stanford on quantum information theory I feel prepared enough to banish the illusions.

The basis of quantum computing is that the concept of a bit, a "light bulb" that is either on or off, can be slightly changed. In classical computing this idea of on or off, all or nothing is how we store data. Through long strings of on and off light bulbs (or 1's and 0's as they are often known) we can express all manners of ideas. Quantum computing uses physical properties of the universe to make things a little bit more interesting. Instead of a bit being on or off, it has some probability of being on, some probability of being off. Basically that means that we don't know if it is a 1 or a 0 and if we look closely enough we can find out, but without looking closely all we know are these probabilities. (And while this explanation still skates around some major concerns, it is accurate enough for this blog post)

But Ryan, what does this have to do with analyzing data? I'm glad you asked! It turns out that since this bit can have a whole continuous spectrum of probabilities of its on and offs, it can store a lot more data in it. This means that we can put are large amounts of data, translate them into these "quantum bits" and use them for our purposes. But, this comes with a great drawback. Information in a "qubit" is not as accessible as a regular bit. When we "read" qubits, information is lost. It resolves into either a 1 or a 0, and any other information is lost. However, there are certain mathematical techniques that we can use to solve problems faster than we could using classical bits. And thus comes the hope that someday we can use these techniques to analyze large amounts of data in a quick fashion.

SUMaC and Statistics

Mathematics and modelling go hand in hand. Much of mathematics is simply development of models that fit some subset of the universe or categorize some phenomena. So when it comes to statistical modelling a good deal of math is often involved. Yet this is a problem, because as useful as this modelling is, it suffers an enormous scarcity issue due to a challenging problem.

Mathematics is one of the least popular fields in all of academia. Many non-academics fear, loath, or reject math for a variety of reasons. Even many academics have professed that math is simply "not for them." And that could have happened to me too. While I was not struggling with math and I even had a certain fondness for it I was not particularly compelled to know more about it. I lacked the curiosity that is necessary to strive for insight to mathematical problems.

Thankfully I avoided this issue by enrolling in Mrs. Bailey's Category Theory class. There I learned an appreciation of math that I had lacked before and it led me on my path toward the Stanford University Mathematics Camp. And that was where I learned the true meaning of being a mathematician. It is more than creating formulas and equations. These things are often done, but it comes down to more than that. Mathematics is about solving problems in a logical manner. And these techniques are the cornerstone of succeeding in the modern business world.

Friday, April 4, 2014

A Brief Summary of Ryan Smith

One of my frequent human interactions in the last few week has been in the Stanford facebook group. It is composed of the admitted students and after regular decisions came out a week ago the group has been inundated with new people. A common trend is for people to introduce themselves, talking about things they like, and similar events. I eventually decided to take a stab at it and here is my introduction. 

I want you all to know that you are influencing me with peer pressure and that is wrong and you should feel terrible. Well, with that out of the way...
Hello everyone, I'm Ryan Smith and if you know a way to put me into a stasis until September please let me know! I am the youngest of 5, an act-a-holic, frequent video game connoisseur, and math enthusiast. That's in fact how I came to apply to Stanford. A few of my friends had attended the Stanford Mathematics Camp (SUMaC represent!) and this led me to apply in 2013 and I proceeded to have one of the best summers of my life. I met lots of amazing people, many of which are in this group, fell in love with the campus, the lifestyle, and the community. From that point on I couldn't see myself going anywhere else and I was and am very relieved that I found out that I was going in December.
On other topics, I've made a second home at my local community theater and have been a part of over 40 performances in the last 4 years and if there is anything that I am going to miss it will be my lovely Fountain Hills Theater.
The other major influences in my life have been Warcraft III, my first major online game, WoW and LoL as a place where I found many of my closest friends, and my obnoxious older sister who has shaped my mind to her own purposes.
I"m going to major in Mathematics and Computer Science and love learning about all of the amazing technologies we use. Anyway, that's me, hi. How are you?

Abstract

Today I am showing my abstract for my SRP presentation. This is the first step toward my actual presentation which I will present in May. Without further adieu here is my abstract.

As the world of data analytics becomes increasingly vital to the business world, many corporations are utilizing it to streamline their marketing, sales, and development departments. This research project explores the data manipulation techniques and tools used by software giants like Google, Facebook, Amazon, and Netflix to market their product and improve their services. These companies utilize petabytes of information that ranges from data on their clients to marketing trends of certain products and this information requires proper handling to prove useful. There are many different approaches to analyzing this data such as time series analysis or regression modeling and as time progresses even more advanced techniques are being developed. The research on this topic was conducted by analyzing the tools used by these companies, such as sentiment analysis and segmentation modeling and the tools used to manage data in general such as SQL and R. The purpose of this project is to provide a perspective on how important information management is to the modern world and shows that the new techniques in data analysis are critically important to success as a major business.


Update on Life

Today I'm giving a general update on things I've been doing for the past few weeks. I've been learning a lot about the programming language/analytics tool R which is enormously useful for creating models and processing data. It shares common features with many languages like C+ or Java and only requires learning a little new syntax. It's made a number of my projects easier.

This last weekend I learned all about the mathematics of sound design helping my theater set up for their annual fundraiser Broadway in the Hills. The gist of it is that setting up a temporary acoustic environment in a day is enormously challenging and requires a LOT of wiring.

In terms of colleges last week was the D-Day for a lot of schools and I am happy to announce that I was rejected by all of the other high end schools I applied to including Harvard, Caltech, MIT, and Harvey Mudd. While slightly saddening I can understand their decisions as my applications may have suffered after I was accepted into Stanford in December.

On top of my internship I am currently a part of 3 performances of the Fountain Hills Theater. I am running sound for the comedy The Man Who Came to Dinner, student stage managing The Little Princess: Sara Crewe, and performing at Papa Vito in our annual Murder Mystery event, Bellamorte! I do these with a mix of pleasure and pain as I know that there will not be many more chances for me to spend time at my home away from home for these past 4 years but I hope to go out with a bang!

Sunday, March 23, 2014

Sentiment Analysis

Every day we create over 2.5 quintillion bytes of data. This is over 20000 times the size of the English text version of Wikipedia. This is information from Facebook statuses to Tweets to product reviews to millions of different things. Now say that you are a company and you want to find out the public opinion on something be it a product, a politician, or a medical procedure. It is not easy to have a person read the equivalent of 20,000 wikipedias to decide if people like something or not. And while much of this data is irrelevant, it is often very hard to know where to look for your data. And even if you do, if you know you want to look at every tweet in the last week and find the opinion on something, well searching every tweet made in the last week would still take more man power than most countries let alone businesses can provide.
Here comes sentiment analysis. Sentiment analysis is a technique that uses computers to analyze text to judge opinion. Your initial thought might be "Well, that shouldn't be too hard, and computers work a lot faster than people." You would be slightly right and slightly wrong. Getting computers to recognize a human concept just from the words used is a very hard task. A first approach would be to look for positive or negative words in relation to your product. But what if someone says "I would hate for someone to live without this product" or "If you enjoy pleasant day of sizzling the skin off your feet or eating food so lively that you get dysentery, this vacation spot is the place for you!" Sarcasm is a complex linguistic process that many humans fail to understand, let alone machines.

And yet this is what these computers do. They analyze text to judge public opinion and companies look at the results and make decisions based on what they find. These techniques are incredibly versatile and used in a myriad of ways in the electronic world.

Time Series Modelling

Time series are incredibly useful tools for modeling systems. Time series are basically representations of variables that change over time. They can be used to model ocean currents, stocks, population, and pretty much everything that changes over time.
Construction of time series is done by analyzing past data for a number of trends. These things can be as simple as is the data cyclic as in does it repeat a pattern over some time interval. Or it can be more complex such as having various frequency dependencies that cause various smaller cycles to occur within a larger cycle.
Some time series are chaotic in nature meaning that starting out with similar but not equal initial conditions can yield large differences in their progressions over time. Many natural systems are chaotic such as water flow during a storm, double pendulum machines, or turbulence in a vortex.
Time series can also be used to model systems that change with respect to other variables over time. This way it can model things like the stock market which changes due to many variables such as inflation or earnings. Developing an accurate time series model then allows extrapolation to future events and allows for predictions to be made. This also shows some of the limitations of the theoretical uses because clearly we do not have accurate predictors of the stock market.

This occurs because there are so many variables that affect our system that we cannot perfectly model the system. Generally we settle for approximations of systems which gives us a general idea but does not give perfect results. We construct these models to allow general predictions to be made and we strive to improve our models as this gives us results that are closer and closer to reality.

SQL and You

If you have ever used the internet in any way you have interacted with SQL and probably don't know it. SQL is an amazing tool used in the backbone of the internet to access data. SQL is the communication line between data stored in a table and a user's screen where the data is wanted. Data such as your personal information on your facebook profile, your tastes as catalogued by Netflix, frequently searched terms from Google, or item types you've shown an interest in to Amazon. All of this information is stored neatly in tables in a server and SQL is the key to getting it where it needs to go.
SQL is a programming language that is a key part of website design. SQL commands can store information inputted by an individual, recall data from a table, and check various conditions in user and website variables to alter commands accordingly. SQL is highly versatile and can be incorporated directly into a websites backbone with html support. What this means is that the very code that describes the layout of a website can have pieces of SQL that handle movement of data from server to client.
SQL is heavily involved with targeted advertisements. Companies that have data on you such as facebook or google will store this information in a table. When web pages are loaded and advertisements are selected they are picked so that the advertisements have characteristics that have appealed to you in the past and have been documented in their databases.


Thursday, February 27, 2014

Segmentation

Segmentation is a key goal of many service providing companies and is a large component of both my research project and my work with Axtria. Segmentation is the process of dividing up clientele into subgroups based on available characteristics. It has uses that range from identifying prospective donors to targeting ads based on an individual's interests.

Groups can be segmented in a number of ways. Say for example you have a list of prospective donors and a list of past donors with a number of characteristics for each. You may know for both groups things like age, estimated income, connections with the organization, and many other variables. And you also have information on the donations of past donors such as amount given, frequency of gifts and things like this. Segmentation would work by trying to find a correlation between characteristics we care about (donation information) and other information about the donor base so we can know which potential donors would be more probable to donate so that more time can be focused on them.

Another example would be how Netflix uses segmentation. Netflix has an immense user base which they have a variety of information on from age, to categories of interest, to individual media that they found incredible or horrific. Thus, when new people sign up for Netflix and list their interests Netflix pattern matches them toward television shows and movies that they are most likely to enjoy. As customers watch and rate more Netflix has a larger data set and can make better and better suggestions.

Segmentation is but one small tool used in data analysis but it proves very useful when applied to the right circumstances.  

Monday, February 24, 2014

Excelling with Excel

The most prominent thought that has occurred to me many times since starting my internship is the fact that Microsoft Excel is a beautiful tool to work with. While far from perfect, the number of amazing, clever, and plain useful features it brings is astounding.

At its core Excel appears to be just a sophisticated spreadsheet: a tool for putting data in nice little rows for easy consulting with a lot of high-tech looking features that are too complex for actual use. But after a little instruction, the intricacies of Excel reduce labor and minimize pain.

Excel can do basic things like sum the items in a column or multiply the values in a row but it has much more power than this. In a table listing population of European countries it can color code the largest and smallest, tell you the average population, graph the distribution of population, and tell you how many country names include the letter 'o'. With a table of financial information it can tell you what attributes contribute the most toward revenue, which factors are nearly irrelevant and how to generate the most profit.


Excel is a wonderful tool to analyze data though it does have its limits. For exceedingly large data tables Excel begins to run very slowly. While its user-friendly graphics tend to help with understanding they also require computer resources based on the number of fields entered. But even with this drawback Excel is a highly useful tool for basic analysis and it erases much of the drudgery from data crunching.

Friday, February 14, 2014

The Wonderful World of Corporate

This week I discovered that for having two parents that have spent the majority of their lives working in the corporate sphere I am remarkably ignorant of the corporate world's functionality. And I guess that after a lifetime of limited exposure to the consumer aspect of business life that this should not be such a surprise.

Over the past week I've been exposed to a number of facets of the modern business. From lectures on employee efficiency to conferences calls with heads of IT I have learned much about the internal structure of modern corporations.

This is a key feature to understanding how businesses use data analytics. Many different departments work together to produce useful output. R and D must interpret data and identify significance, Engineering must figure out how to manufacture this change, Marketing must find a way to publicize this technology. All these independent sub-units must cooperate and listen to each other to develop a useful product.

Information sharing is crucial for cooperation and it brings many issues to the table. Even the simple communication of data is complicated. Data may be in Excel Spreadsheets while the analysts might want to look at it in R. Getting many people from many organizations on the same page is a difficult task and involves much communication so that services can be smoothly rendered. And all of these things coalesce to make the corporate network a spiderweb of interconnecting chaos.

Tuesday, January 21, 2014

The Beginning of the End 1/21/14

As the second trimester winds down we seniors find ourselves preparing to journey far and wide to fulfill our Senior Research Projects. My journey is electronic rather than geographic as I work alongside the data analytics company Axtria and learn about the field of data analysis from the comfort of my own home as Axtria does most of their work online.

I am Ryan Smith, future student of Stanford University, passionate mathematician and novice computer scientist. Using the techniques I learn while working with Axtria will help me research how companies use data to target their customers, improve their business practices and make large sums of money. For more information on what I am researching you can view my entire proposal here: https://drive.google.com/file/d/0BzhSjAYafblcX3pzR1lxQlVyb2c/edit?usp=sharing.


Starting with the end of the second trimester on February 7th, I will start my project by beginning my internship with Axtria. I will also be investigating the various applications companies like Google, Amazon, Facebook and Netflix use to manage data and what they get out of it. This internship and research project will be my first step toward life as a college student studying mathematics and computer science at Stanford University!

You can also find more information on the company I am working with, Axtria, here: http://axtria.com/.