Blog

Feeds

Login Form






Lost Password?
No account yet? Register

Syndicate

MySQL MyISAM versus InnoDB usage
Wednesday, 01 October 2008
MySQL MyISAM versus InnoDB usage

MyISAM Table Advantage:
1. High speed logging (mod_log_sql for Apache) - Thousands of record inserts/second.
2. MyISAM Merge for analyzing across logs - across similar tables - ideal for logs statistical analysis
3. Listings, i.e. real estate websites, job websites, social networking listings, product listings, comparison shopping listings, stock listings.
4. MyISAM and compressed MyISAM (read-only) tables takes much less space - perfect for read-only DVD.
5. MyISAM can be text search
6. Indexing is much faster.

 InnoDB advantage:

1. Any kind of monetary or account transaction where balances are involved.
2. Referential integrity by using foreign keys. For example, a customer table with a foreign key "orderID" can be used to remove a customer's orders when this customer's record is deleted.
3. High read and write at the same time, i.e. stock quotes

Table conversion gotcha:
When MyISAM tables are converted to InnoDB, make sure all "SELECT COUNT (*) FROM aTable ..." are addressed (or removed) because MyISAM tables automatically has these counts where InnoDB tables has to scan the rows.

For scheduled aggregation of data from many different sources where wholesale cleansing of hundreds of millions of records, use MEMORY type table for 2 orders of magnitudes faster performance.
 
Comparing Analytica and DPL for financial modeling
Tuesday, 02 September 2008
Read more...
 
SAS SQL Tutorials
Thursday, 14 August 2008
 
Highly Scalable Ad Exchange
Tuesday, 12 August 2008
Ad Exchange and Highly Scalable Analytics

Possibilities? What are the best of breeds?

Check out this massive grid performing Monte Carlos Simulation on Amazon EC2 cloud computing: http://www.griddynamics.com.
 
SEM New Direction
Tuesday, 12 August 2008
" ... The shift in SEM is not seen in just one or two categories, but across the board. We’re witnessing a transition from directory and keyword-based search to use of rich semantics and a focus on user experience." - by Brent Terrazas · Wednesday August 06, 2008.

http://www.adtechblog.com/blog/detail/search-marketing-yesterday-today-and-tomorrow/.
 
Facebook Pixel Real Estate
Monday, 04 August 2008
This refers to the areas on the screen retained by Facebook versus areas used by a developper.

Pixel real estate enables Facebook to do its own advertising. The area used by developpers enables developpers to do their own advertising.

With over 400,000 developers and 90 million users, Facebook can introduce an ad exchange for the thousands of smaller applications to provide services to this very long tail of diverse applications.

NOTE: Definition for 'developer' varies, i.e. if a developer has a public app(s) or in development. Some numbers follow:

Facebook snapshots facts from http://adonomics.com/ on August 8, 2008:

".. * There are 863,234,671 installs across 35,089 apps on Facebook with over 200,000 developers currently evaluating the platform.

* These applications were used 34,175,797 times in the last 24 hours and have a combined valuation of $391,178,677..."

The number of apps increased from 34,676 apps with combined valuation of $374,832,129 a week ago.
 
How many Facebook users uploaded their pictures?
Friday, 01 August 2008
Question: How many Facebook users uploaded their pictures?
Answer: roughly less than half as of September 18, 2008.

Methodology:
The pivot shows there are more users not having a picture than users with pictures in every locale (0 means no pic, 1 means has a pic)
Samples: 6508 users from Facebook database table USER
Code:
$FQL2="SELECT first_name, last_name, pic_square, locale from user WHERE uid=$xxx";
$resultset2 = $facebook->api_client->fql_query($FQL2);
Userid are created by a hash function in Facebook. The uid $xxx in the sample set ranges from 1050170000 to 1450180000

Image
 
Business Intelligence Tools
Thursday, 17 July 2008
Looking at available analysts surveys and reports on BI tools (Business Intelligence), it is inconclusive as to what are the best BI tools overall. Case in point: 1) The independing http://www.bi-survey.com/ by Nigel Pendse points to Microstrategy as the leading BI tools in many criteria. 2)Cognos won various awards over the years, i.e. eWeek, TechTarget: http://www.cognos.com.cn/news/awards/index.html

Microstrategy's very small footprint and streamline integration of its various packages including finance may have propelled it to the front of the BI race.

Cognos, however, has the Go_Search feature enabling Google's style search of the report repository - a feature that is invaluable for end-users and power users alike.

SAS is unique in that it integrates full statistical and predictive analytics/data mining features.

If modeling is important, Analytica is a good tool to have. Analitica has influence diagrams for scenarios, probability distribution, and Monte Carlo simulatio. Note that Analytica is not a traditional BI tool, i.e. reporting.
 
Fraud detection with Complex Event Processing
Tuesday, 27 May 2008
Opportunity and threats are two sides of the same bottom line. Threats include frauds of all kinds, isolated and "wholesales" frauds. Fraud detection is mostly about finding significant exceptions from normal patterns. The data transformation in typical Business Intelligence data warehouse need to be transformed for fraud detection algorithms. In typical BI applications, data are gathered for predicting buying patterns, converting patterns, or churning pattterns, i.e. what campaigns produce the most conversions for certain market segments. Here data are grouped into categories, or subgroups. A model can then be used to predict the behaviors of the subgroup.

There are related Fraud metrics: detection, remedies, and preventions. Fraud detection mainly uses statistical anomalies. Fraud prevention employs "what-ifs" scenarios and anticipation of possible breaches.

Fraud detection and remedies shares similarities with information security breach detection and remedies. For example, "honey-pots" are used to lure potential violators. Fraud reduction metrics can guide the process of selecting remedies and fraud detection methods. "No stone is left unturned" is a useful notion in the process of forming possible scenarios and eliminating non-useful hypotheses. On the Internet, not only original IP addresses information are useful, but pattern redirect, time of response can indicate suspicious activies.

Some possible technologies:
- Rule-based engines, i.e. Blade, Ilog
- Complex event processing, i.e. SqlStream, Coral8 (see a previous blog here)
- Neural nets.
- Tibco's SOA architecture using Joint Directors of Laboratories (JDL) data fusion model

The JDL can use statistical and data mining techniques including classification (trees), association, correlation, clustering to produce normalized event streams. From this, scoreCards are produced to show possible fraudulent activities. A rules-based system can be used to classify kinds of frauds. 
 
Clickstream Data Source
Tuesday, 18 March 2008

CLICKSTREAM DATA COLLECTION METHODS

 

There are five clickstream data sources for web analytics:

 

 

PROS

CONS

Web logs

1.       Readily available from web servers, i.e. Apache common log format, or Windows 2003 log format.

2.       Free and low cost web log report software to sophisticated web log software i.e. SawMill

3.       ONLY method that record search engine bots

1.       Need programmers to create reports for marketing and user behavior analysis – Filters are numerous to sort out relevant data.

2.       Cookies are needed for visitor identification

3.       Page caching traffic is not in the web log.
Users pressing of the "Back" or "Forward" buttons on the left corner of their browsers is not recorded in the web log.

Web beacons (or Page bug)

1.       Simple to implement around an HTML IMG tag

2.       Choose what to record

3.       Less filtering as bots do not request beacons

4.       BEST for multiple domain tracking

A third party such as Coremetrics can save a cookie on a user's browser. This cookie can be recognised across domains.

1.       Third part cookies are needed – Many spyware programs blocks image download and 3rd party cookies. If used, users must  be informed, as tracking users across non related domains are in violations of consumer privacy.

 

Javascript tags (page tagging)

1.       Easiest to implement – inserting a few lines of JavaScript codes on HTML web pages. Some products, such as Clickstream.com's automatically embeds into the header of every dispatched page regardless if it's static or dynamically generated.

2.       MOST CONTROL on what information is captured for Flash, or AJAX

1.       Capture only client side information

2.       Not easy to capture information on downloaded pdfs, mp3, avi, etc.

 

Packet sniffing

1.       Capture the most data of all methods

1.       Page caching is not captured

2.       Interactive Flash or Ajax applications traffic is not captured.

3.       MOST EXPENSIVE

Event logging

1.       Capture data at the application layer, not web server layer

2.       Capture Rich Internet Applications such as Flash and Ajax.

1.       Only certain software applications offer this. Since this is a real time capture - before the 'return' key is hit. Applications using Coral8 or Sqlstream can go further by loggin every key stroke as users type in. This can catch their changing their mind in mid stream.



NOTE: If Javascript is used, place the code at the front of the page will capture the traffic even if visitors leave right away.

Further Readings:

Choice and the Internet: From Clickstream to Research Stream
http://demo.ebusiness.uoc.gr/content/downloads/Bucklinetal_ChoiceandtheInternet.pdf

Clickstream Data Warehousing
http://www.clickstreamdatawarehousing.com/tblofcontents.html

Measuring Rich Internet Applications--Interview With Avinash Kaushik
http://www.webanalyticsassociation.org/en/art/?133
 
Keyword portfolio optimization tutorial
Monday, 01 October 2007
Introduction: This tutorial show how to algorithmically manage Adword (or similar ad exchanges) using Analytica, What's Best Linear optimization (Excel extensions), and MineSet (cluster analysis). Possibilities of using complex event processing using SQLStream or Coral8 will be discussed, as well as the use of Google's Adword API (Apility for PHP) for automating the data gathering process.

Question: How much to spend (minimum) per keyword to acheive revenue goals in real-time (as soon as data is availalble or per schedule)? - What combination of ad purchases (or media/channel purchases) to reach target audience percentages in demographic groups -  for real estate, social networking, products, or services.

STEPS:

   1. Find cost per 1000 clicks (in Google Analytics)
   2. Create a Revenue model (Analytica or Excel). Revenue depends on a number of variables including search engine position, bounce rate, conversion rate, affiliate traffic, channel traffic, total market volume, and others. Map a large set of hundreds of thousands of keywords to a set of hundreds or thousands of keyword groups (clusters - Analytica is useful for sparse matrix mapping, where local changes within a group of keywords can be mapped to the same Adgroup.)
   3. Model the question as a linear optimization. This example shows 4 keywords. It can be extended to thousands of keywords (or Adgroup) in Excel (see image 1 below). Calculate the coefficients that each keyword group contributes to each goal. Note that this is an approximation. The more related the keyword variations, the better the approximation, i.e. the keyword phrase "medium price home in portland oregon" can have hundred of mutations (variations, mispell, and closely related phrase).
   4. Use Google's Adword API (i.e. Apility).

Image 2 shows the solution. The percentage of blend shows what percentage of what keyword to buy. For example, 49.9 % of clicks should be through keyword "home". This yield the most revenue for the investment. Note that the 4th keyword yields great contribution, but is not a bargain at $89.00. It is not purchased.

What are dual values? In Image2 below, spending is saved by $5.00 (cell J9) per unit (Cost/1000clicks) if the revenue requirement for goal 3 is reduced from 5.2 to 4.2 (one unit). Similarly, if the revenue requirement for goal 4 is reduced from 19 to 18, spending is saved by $0.02 per unit. For goals 1 and 2, reducing the requirments give no savings since there are excess of  contribution to goal 1 and 2 from these keywords given the revenue requirements.

Goals are Conversion Funnel goals. For example, in Google Analytics, four URL goals can be set, i.e. one goal could be the "Thank you" page.
Image

Cell F18 shows a dual value $59.89. This is the amount by which the 'Cost/1000clicks' would have to be reduced before it should be purchased in the keyword mix.

IMAGE 1 - SETUP

Image

IMAGE 2 - SOLUTION

Image

IMAGE3
Image

 
CPC & Estimated traffic/day for most common medical diagnosis/keyword
Thursday, 27 September 2007
Source: Google Analytics
Medical keywords are based on ICD-9

The long tail analyses below use Google Analytics data for CPC for about 1500 keywords for Adword campaigns. These graphs (linear and logarithmic scales) show great opportunities for hospitals/medical centers looking to increase web traffic according to these most common medical diagnosis/keywords. With a given budget, linear programming can optimize the number of clicks at the current CPC. Doing this in house can save thousands of dollars per month for portfolio style PPC campaigns. For optimal bidding given competitors' current bidding, stay tune for upcoming blogs.

Image

Image

Keywords

Search Volumelume

Estimated Avg.
CPC

Estimated Ad
Positionns

Estimated Clicks / Dayy

Estimated Cost /
 Day

 

 

lower

upper

upper

lower

lower

upper

lower

upper

Skin

4

$0.57

$0.85

1

3

3,034

4,347

$1,730.00

$3,720.00

Dental

4

$0.57

$0.85

1

3

1,661

2,480

$950.00

$2,120.00

Shock

3

$0.42

$0.63

1

3

715

915

$310.00

$580.00

Colon

3

$0.49

$0.74

1

3

559

788

$280.00

$590.00

Bladder

3

$0.37

$0.62

1

3

554

696

$210.00

$430.00

Twins

3

$0.34

$0.56

1

3

526

665

$180.00

$380.00

Allergy

3

$0.56

$0.83

1

3

495

659

$280.00

$550.00

Lung

3

$0.48

$0.72

1

3

514

648

$250.00

$470.00

HPV

3

$0.37

$0.62

1

3

510

639

$190.00

$400.00

Prostate

3

$0.59

$0.88

1

3

393

534

$240.00

$480.00

Warts

3

$0.54

$0.80

1

3

352

443

$190.00

$360.00


See the remaining by click on 'Read more'
Read more...
 
CPC (Cost per click) for common medical diagnosis (about 1500)
Thursday, 27 September 2007
CPC for common medical diagnosis (using ICD-9)
This covers about 99.99 percent of common diagnosis
Using Google Analytics - Sept 27, 2007

KEYWORD Relative search volumne Ave. lower CPC Ave. upper CPC
Insomnia sleep disorder 1 $0.69 $1.00
Hormone replacement therapy 2 $0.64 $0.96
Sweating excessive 2 $0.63 $0.94
Incontinence urge 1 $0.62 $0.93
Gas-bloating 2 $0.61 $0.92
Premature ejaculation 3 $0.61 $0.92
Alcohol abuse 3 $0.60 $0.91
Chronic fatigue syndrome 2 $0.61 $0.91
Dry eye syndrome 1 $0.60 $0.91
Hyperlipidemia mixed 1 $0.60 $0.91
Child abuse 3 $0.60 $0.90
Palliative care 2 $0.60 $0.90
Plantar fasciitis 3 $0.60 $0.90
Dyslexia 3 $0.60 $0.89
Gynecomastia 2 $0.59 $0.89
Incontinence stress female 0 $0.59 $0.89
Molluscum contagiosum 2 $0.59 $0.88
Obesity morbid 2 $0.59 $0.88
Prostate 3 $0.59 $0.88
Bipolar disorder 3 $0.58 $0.87
Disability exam 1 $0.58 $0.87
Tendinitis achilles 1 $0.62 $0.87
Bruxism 2 $0.58 $0.86
Bunion 2 $0.57 $0.86
Depressive disorder 2 $0.57 $0.86
Gout 3 $0.57 $0.86
Irritable bowel syndrome 3 $0.57 $0.86
Multiple sclerosis 3 $0.58 $0.86
Ulcerative colitis 3 $0.57 $0.86
Attention deficit without hyperactivity 0 $0.56 $0.85
Dental 4 $0.57 $0.85
Drug abuse 3 $0.56 $0.85
Insomnia transient 0 $0.56 $0.85
Rosacea 3 $0.57 $0.85
Skin 4 $0.57 $0.85
Atrial fibrillation 2 $0.56 $0.84
Infertility male 2 $0.56 $0.84
Osteoporosis senile 0 $0.56 $0.84
Allergy 3 $0.56 $0.83
Asthma 3 $0.55 $0.83



Read more...
 
Social Networks, Social Graphs, Web 2.0/Web 3.0
Monday, 17 September 2007
social networks 3.0 A social graph such as Facebook graphs the real world into Internet. This mapping includes the rich and complex relationships in the real world, including hierarchies and contexts. This takes the web closer to Web 3.0.

What is Web 3.0? One definition: Web 3.0 is the Web with contextual embedded information. At the start, it should have a model of words/concepts and their basic relationships. For example, the word 'house' and the word 'door' has one basic relationship 'has' - house has doors. If there is project to start Web 3.0 according to this definition, it might take a few years for all words and their relationships be encoded depending on how many encoders will be doing this work in parallel.

What is Social Networks 3.0? Using the Web 3.0 definition above, social networks 3.0 will also have search by contexts. But being more specific and more contained, social networks 3.0 is ready to be implemented now.

From a usefulness point of view, the next generation social networks needs to provide more benefits as the novelties of first generation social networks begin to wear off. MySpace is filled with email Spam and advertisements that every measure has to be taken to protect privacy. In contrast, Facebook's rise in social networks is in part due to its validating members with their affiliated school or work email address. Facebook's open API enables third party Jaxtr.com application to provides members the option to contact each others' by phone without giving out their phone number - This is a example of Web 2.0. But is Facebook's an example of a big leap forward to Social Networking 3.0? Not yet, not until it can provide contextual search, search by associations, and agents looking for what members want from classified, and from new and changed information from various parts of the site including members' new info, etc. In other words, a system that is capable of reasoning about its members and their relationships according to set rules or self modifying/self adapting rules. Google and Yahoo have begun to merge search engines with social media (http://www.searchenginejournal.com/social-medias-direct-influence-on-search-engine-ranking/5576/ )

Tribe.net has a built in feature where members can see how they are related to other members. LinkedIn also has a similar network relationship. However, a simple connection to another member doesn't provide deeper information about the connection, i.e. how do they meet or what they have in common. Facebook allows member to specify how they meet, but it doesn't allow member to show this information yet, nor to create personal agents, etc. These are glimpses of the future, although they are limited.

Dating networks are a special class of social networks. They contain much more personal information. Adultfriendfinder.com is among the top 50 on Alexa. But are they providing better opportunity for meeting a date? This is debatable. Meeting people in live venues such as dance clubs, sporting events, or social functions are more likely to bring about real dates. Most online profiles are created to fulfill quick fantasies or mind games. Next generation dating networks have to reflect reality more closely. For example, people's physical appearances and their personality have to be more apparent. Stickam.com, a community of video bloggers, takes this to the next step. Even so, different video equipments can render people differently. Skinny people might appear more heavy and vice versa. Meeting through technology is still inadequate in comparison to the availability of information through live interactions.

Are online platforms more conducive to interpersonal interactions? Online learning systems tried to push this direction and have discovered that they cannot replace the live classrooms in a majority of learning subjects. For most subjects, online systems are best used as supplementary systems to connect students with each other and with the teacher. In a similar online/offline line of thought, some online social/dating networks have created real events to supplement online interactions. For example, Downelink.com has monthly dance events for its members in some cities.

Most online social networks and dating networks have not incorporated proven real-life interactions. For example, few dating networks have a staff of relationship counselors, nor self assessment tests. Here the next step is to incorporate a personality testing service to assess communication style, i.e. Keirsey temperament test or Implicit Association test for true preference. Further, only a small percentage of people are aware if they have adopted, as part of their bringing up, the logical, romantic, or best friend model of relationship.

From the data modeling point of view, ontology can be used to model people and their relationships. For example, Tribe.net or Friendfinder.com can model members' networking, i.e. when they become friends, using semantic networks (http://www.oracle.com/technology/tech/semantic_technologies/ ). These new database structures are designed to store relationships and thus are more efficient than relational databases (one to two orders of magnitude). Relational SQL queries joining tables with more than a million records can be computing intensive in addition to being inefficient for known-in-advance relationships.

The next generation social networking/ dating networking sites will bring a host of IT performance and search challenges which will necessarily bring about a convergence of grid computing, ontology, database redesigns, and psychology assessments technologies. Unlike the clear IT and computing technology maturity models, social interactions with its psychology component do not have clear maturity models. Speaking very broadly, social networking and to some extend dating networking sites do reflect the larger society no matter how off-beat or playful they may be.


 
Data Mining Visualization and Employee Communication Preferrences
Tuesday, 28 August 2007
The Keirsey™ Temperament Sorter®-II uses four scales to determine both Temperament and Character classifications. The four scales measure an employee's preference for Extraversion versus Introversion, Sensing versus Intuiting, Thinking versus Feeling and Judging versus Perceiving. The four disticnt temperament patterns are the following:
    1. Guardian: regards duties and accountabilities, following rules and not stepping over the lines. The four types of guardians are inspectors, protectors, providers, and supervisors.
    2. Idealist: holds a vision of what might be possible, reaching goals without compromising beliefs. The four types of idealist are healers, counselors, champions, and teachers.
    3. Artisan: deals with the here and now, being practical with his/her faculties is most important. The four types of artisans are composers, crafters, performers, and promoters.
    4. Rationals: plans what can be done, rewriting the rules to solve the problems. The four types of rationals are builder/architects, fieldmarhals, inventors, and masterminds.

Each of these temperament exhibit a pattern of communication using the following four preference scales:

The four preference scales:
  1.     Expressive versus attentive
  2.     Observant versus instropective
  3.     Tough-minded versus friendly
  4.     Scheduling versus probing
    
    For example, the idealist-teacher type described below is a predominant eNFj (expressive, introspective, friendly, and scheduled)
    
    For detail descriptions of these preferences, see this page:
http://www.advisorteam.org/administering_the_KTS-II/background.html

The video below shows a 3D visualization of a group of employees and their communication preferences (Using MineSet Visualization). In the 2nd half where the blue and red dots represent male and female, note that sex is not a differentiating factor. There is range of value of Being Observant and Probing where there is no employee. Also note there are four major clusters. This signals a line up of a very strong set of values preceding a certain corporate event.
Read more...
 
SQL injection & Paros Proxy
Tuesday, 28 August 2007
Only a few weeks ago, on August 12, 2007, the United Nations web site was defaced. A few weeks before that, on June 29, 2007, the Microsoft UK web site was defaced as well.

This blog outlines the steps to assessment if there are sql_injections vulnerabilities on a specific website using the free Paros Proxy, a web security assessment tool:

1. Download a copy from http://parosproxy.org.
2. Configure Firefox browser as follows:
- Select Tools, Options, Advanced, Network, Settings
- Check the box 'Manual proxy configuration'
- type in 'localhos' and '8080' for HTTP proxy and Port. Click ok
3. Start Paros and Firefox
- Browse the site to be scanned
- On the left pane of Paros, select the site added from browsing
- Select Analyse, Scan Policy, Injection to see if sql_injection is checked
- Select Scan
- Select Report, Last Scan Report to see the assessments - and actions to fix the vulnerabilities.
 
Blog on Sqlstream.com
Tuesday, 14 August 2007
Tom Tuduc- Data Mining, Web Strategy, Risk & Decision Analysis, Influence Diargram, Web 2.0, Health Care SQLstream.com is about to come online (under construction). Its patented technology originates from the research work at Stanford and Berkeley (Probabilistic Complex Event Processing) - Coral8's technology is based on the same, and to certain extent, Streambase. It seems that SQLstream's focus is on real time integration versus Coral8’s focus is on dashboard, i.e. high transaction loads, hundreds of collection points. At this point, Sqlstream's interests include financial and telcom sectors.

Technical Highlights:
- SQLstream messaging is modeled after Java’s JMS and use java for plugins development.
- IMB eclipse development environment is used.
- Push focus.
- Very small footprint streams

The technical difference between Sqlstream and Coral8 is java vs. c++. So their main competitor for Sqlstream is Streambase , which is also a java based CEP solution.  

To the end-users, i.e. financial sector, java vs c++ is only a technical and minor issue - mostly matter to developers and not end-users. The whole point of SOA and XML is to enable developer to develop in any language (nice theory, but not completely true).  However, with CEP using a SQL stream, it can integrate any kind of applications, in any language, as long as they talk with the DBMS, i.e. Oracle.  Thus their pitch is focused on integration - which is really catchy - Integration is already one of the biggest IT market, over $50 billion a year pervasive in every single industry. Dashboard and PKIs are more valuable if they can show an integrated view of the enterprise. It depends on who is buying. A committee with end-users, or the CTO and developers.

What about 2nd order predicate calculus extension (i.e. Prolog is one of the well known 2nd order language)? Sqlstream can derive ancestral streams, however, a true 2nd order logic is not there. Webarches predicts that it is between one and two order of magnitude of improvement in functionality and event definitions.

2nd order simply means that instead of binding a variable x to a value, i.e. integer, as in first order calculus, in 2nd order, x can be binded to an expression, i.e. (x greater than 4) or (y smaller than -1). Since relational databases are first order predicate calculus, it seems that a natural evolution of SQL is 2nd order. This is partially done by PROLOG. Prolog, however, is not a production language at this point, only for research, i.e. you set a goal, and the system will find facts to realize the goal.

There is no 2nd order predicate calculus in SQLstream, although there are some specific bindings, for example, if A is the parent of B, and B is the parent of C, the SQLstream can deduce A is the ancestor of C.  

Sqlstream is at the frontier of application and data integration, messaging, middleware, and real-time response.  It remains to be seen as to how SQLstream plays with the proliferation of XML, XSLT, Xquery, and other XML technologies as these are not SQL based. For example, storing an XML document as a value in a field in a table is not ideal and scalable.

However, data and/or application integration is just the beginning. Perhaps Sqlstream is focusing on realtime integration of financial transactions, i.e. buying and selling of options in a window of time, i.e. if a price goes to a certain number, then exercise the option.

Integration is a step in the door in the financial sector these days. Here, the key requirements are:
- How fast and how much data can the CEP engine process, response time, and level of complexity, i.e. 250,000 messages/sec or subs-second response time - with backup benchmarks.
- How easy and flexible is it to model complex queries, algorithms, correlating both live and stored data, from many sources.



 
Blog. Search versus Econometrics Modeling
Friday, 10 August 2007
Web Strategy & Web Analytics

Search is redefining the use of econometric modeling, surveys, polls, and statistical modeling in web strategy. Why are the number of searches on the Internet are two order of magnitude lower than their corresponding statistical and survey estimates? The Influence Diagram below links the hard data from search engines along with the numbers produced by surveys and studies, i.e. Stanford Health insurance Study, PIP Health Survey , and Harris Poll Interactive . A sound web strategy must include different channels of data to discover and understand apparent orders of magnitude difference in the data. For example comScore's panels are consent web users, while Hitwise's data includes all web users with certain Internet service providers, i.e. Comcast.

Image

QUESTIONS:
1. What are the changes (and why) in demand (from models) and search (hard data) for health insurance in each state?
2. What health insurance products should be marketed to what population in what state this month?
3. What marketing campaigns should be carried out (for each segment, each state, this month)?
4.  How to categorize web visitors in real time  and act accordingly? (serving what pages? what banners?). Once visitors  provide certain information, what behaviors can be predicted (by segment, by state, by time of day, by day of the week, etc..) - 4b. How to read meaning into categorized web visitors (from clustering) from weblog for the first time?
5. What paths through the website are users taking?
6. What percentage of visitors buy the health insurance products (reach 'thank you' page) (by time of day, day of week, cities and states, age, sex)?

REFERENCES
Reaching The Online, Uninsured Healthcare Consumer (February 14, 2007)
Which Health-Related Web Sites Are The Uninsured Visiting, And What Do These Consumers Research?
by Julie Hanson, Bradford J. Holmes
    The millions of uninsured consumers in the United States constitute a market that no plan can ignore. In fact uninsured consumers are online, and they are actively visiting health-related sites to research numerous health-related topics. Forrester found that uninsured, online consumers are becoming increasingly comfortable with the Internet, spending more time online and showing similar, if not greater, levels of technology optimism than the average consumer. To best reach this consumer group, marketers in the healthcare industry should favor the channels and sites that the uninsured favor today to promote their products and services.
Read more...
http://www.forrester.com/Research/Document/Excerpt/0,7211,41474,00.html

The Harris Poll® #59, August 1, 2006
Number of "Cyberchondriacs" - Adults Who Have Ever Gone Online for Health Information - Increases to an Estimated 136 Million Nationwide
"Searching the Internet for health care information has become more widespread in the past year after three years of little growth. Use of the Internet to search for health-related information by online U.S. adults has increased markedly both in terms of percentages (from 72% in 2005 to 80% now) and in numbers. This brings the number of all U.S. adults who have ever searched for health information online (Harris Interactive® refers to them as "cyberchondriacs") to 136 million, a 16 percent increase from 117 million in 2005."
Read more...
http://www.harrisinteractive.com/harris_poll/index.asp?PID=686

Online Health Search 2006
"Eighty percent of American internet users, or some 113 million adults, have searched for information on at least one of seventeen health topics. Most internet users start at a general search engine when researching health and medical advice online. Just 15% of health seekers say they 'always' check the source and date of the health information they find online, while another 10% say they do so 'most of the time.' Fully three-quarters of health seekers say they check the source and date 'only sometimes,' 'hardly ever,' or 'never,' which translates to about 85 million Americans gathering health advice online without consistently examining the quality indicators of the information they find. Most health seekers are pleased about what they find online, but some are frustrated or confused."
Read more...
http://www.pewinternet.org/pdfs/PIP_Online_Health_2006.pdf

Consumer-Directed Health Insurance Products: Local-Market Perspectives
Read more...
http://content.healthaffairs.org/cgi/content/abstract/25/3/766

Internet Audience up 10 Percent Worldwide
By Enid Burns , March 6, 2007
The Internet reaches 747 million people worldwide. The data are according to numbers released by comScore Networks's World Metrix service. http://www.clickz.com/showPage.html?page=3625168

Accounting for the Cost of Health Care in the United States - McKinsey Global Institute
http://www.mckinsey.com/mgi/rp/healthcare/accounting_cost_healthcare.asp

Percentage of People Without Health Insurance Coverage by State Using 2- and 3-Year Averages: 2003 to 2005.
http://www.census.gov/hhes/www/hlthins/hlthin05/hi05t10.pdf

Income, Poverty, and Health Insurance Coverage in the United States: 2005
http://www.census.gov/prod/2005pubs/p60-229.pdf

Computer and Internet Use in the United States: 2003 - issued October 2005
http://www.census.gov/prod/2005pubs/p23-208.pdf
 
 
Analytica: spreadsheet, database, and statistics in one package.
Thursday, 24 May 2007
Analytica is three applications in one. For many data mapping, transformation, and stochastic simulations, Analytica can cut down the development time over an order of magnitude by leveraging parallelism in Analytica's intelligent arrays. In this simple illustration, the top 20 DRGs by national costs for males are mapped readily to the cardiology medical specialty. Note that four of these DRGs are in the cardiology specialty (the left table in the screen short below).

Image

Data source: AHRQ http://hcupnet.ahrq.gov/HCUPnet.jsp?Id=F6CCF19B2D5688C3&Form=SelALLLISTED&JS=Y&Action=%3E%3ENext%3E%3E&_ALLLISTED=No
 
Decision Analysis & Web Analytics
Wednesday, 23 May 2007
Keywords: Web Analytics Maturity Model, Web Analytics Management, Decision Analysis.

Should you choose a portfolio management style for search engine marketing (SEM), i.e. Efficient Frontier's or a do-it-yourself SEM? This depends on the number of keywords you manage. Can you predict the expected value of your web-analytics effort? What is the value of being able to simulate the what-if variables?

The figures below show some of the variables. Some are probabilistic, some are options you can choose. For User Behavior, it's the possible paths a customer is going through your website.

Image

Image
 
SawMill Weblog Analysis and Web Analytics
Monday, 21 May 2007
Comparing Sawmill to the web analytics tools is not a simple "checking of boxes on a form". While Sawmill is a weblog analysis tool, web analytics data come from weblog, web beacons or page bug(i.e. transparent 1x1 pixel from a third party), Javascript tags (widely used, i.e. Google Analytics), packet sniffing, and event logging (see Blog Decision Analysis and Web Analytics on this page). However, Sawmill (Enterprise version) processes the log and stores the data in MySQL making dicing and slicing using SQLqueries a simple process (nearly free Data warehouse). When used to its full potential, Sawmill is useful with understanding User Behavior such as clicktream, click density, segmentation, longtail analysis, predicting what users will do next (using with a data mining tool, i.e. MineSet). Sawmill is also useful with understanding the patterns of user behavior, for example, Why are a number of users going through a particular path before reaching the "Thank you" page?

Using SawMill with Excel, or SawMill with MineSet, data visualization can reveal the potential traffic that can be generated with long tail analyses, i.e pageviews or  keyword searches. For example, using Excel, a logarithmic plot of  pageviews for each page can reveal where are the humps, dips, and drooping tails of  pageviews - Note that humps and dips can be readily seen using a log scale but not a linear scale. Using humps and dips, website traffic (or traffic to specific landing pages) can be increased by increasing relevant content at those humps and dips. For more information on log normal for visualizing the long tail, see this paper: "A Brief History of Generative Models for Power Law and Lognorma