Efficient fractional-factorial design for mixed logit model
Posted: Wed Mar 10, 2021 4:18 am
Dear Choice-Metrics team,
First, I would like to thank you for developing Ngene and for releasing such a detailed user manual! It not only explains the software well, but also provides more insights on experimental design theory for SC experiments than most of the literature. I have read it on the weekend and have a much better understanding of SC experiments now. Before continuing with some information about my research context and my inquiries, I would like to warn you that I’m very new to discrete choice analysis and therefore want to apologise in advance for any silly questions or mistakes.
My objective is to investigate whether entrepreneurs who raise equity have high preference heterogeneity concerning their investor choice. For this reason, I want to analyze the importance of investor characteristics as well as interaction (complementary and substitutability) effects between them.
Model specification: My idea is to present entrepreneurs 10 choice sets that comprise 3 alternatives (hypothetical investors 1, 2 and 3). Each alternative is described by 5 generic attributes (e.g., reputation or R&D-related support), each of which can take on 3 levels (low, medium, and high). I want to dummy code each generic attribute into two dummy variables indicating the deviation from the reference value. Participants are then asked to identify the best and the worst of the three hypothetical investors (each respondent thus makes 20 choices.). I found this approach in another paper and it has the advantage that I will obtain a complete ranking of alternatives for each choice set. As an estimation method to analyze such rank-ordered data, the authors in the mentioned paper applied a rank-ordered mixed logit model (also called random coefficient models). As this worked well for them, I want to follow them and also apply this model type. Considering all these specifications, I want to generate an efficient fractional-factorial design. In this context, I developed the following questions:
1) Coding scheme: I want to show 3 qualitative levels (e.g., low, medium, and high reputation). Is it correct to use design coding (e.g., A[0,1,2]) and to replace the numbers by the words «low, medium, and high» later on in the questionnaire?
2) Dummy coded attributes: In the mentioned paper, they used the value with the (presumably) lowest benefit as reference for each attribute to «ensure convenient interpretation of coefficient estimates». Let’s say I have the attribute level «having a low reputation» and I want to set this level (since it has the lowest benefit) as the reference. Considering page 111 in the manual, would I then have to relate the 2 in the term A[0,1,2] to the level «low» as the dummy variables (e.g., b1.dummy[1.2 | 2.2]) will automatically relate to the first two numbers/levels in the brackets?
3) Attribute level balance and algorithms: In the paper I found, the attribute levels with the (presumably) lowest benefit appeared 12x in the design each, whereas the other levels appeared 9x each. If I understand it correct, then this means that the attribute level balance property was satisfied, right? I would prefer a similar setup to ensure that the parameters can be estimated well on the whole range of levels. Is this a problem considering that row based algorithms like the Modified Federov algorithm are suggested for unlabelled designs like mine?
4) Rp vs. ec vs. rpec: In the paper I found (sorry for again refering to it), they also generated their efficient design with Ngene. However, I do not understand why they apply a different utility function (compared to the ones described in the manual). They modeled the utility of alternative j in choice set t for respondent n as a quadratic additive function of the alternative's characteristics, described by the vector xnjt. This function contained a linear term ßnxnjt, and a quadratic term. In the latter, which captured the interaction effects, the symmetric matrix carried a coefficient for each interaction that they included. I think up to here it is identical. Then they state "Note that both the coefficient vector and the coefficient matrix are respondent-specific. The enjt are residual error terms that are assumed to be independently and identically distributed and to follow an extreme value distribution." If I understand it correct, this is a combined design (random parameters and error components) but the error components are not normally distributed (as suggested in the manual). Do you have an idea why this could make sense and whether it is possible to model it with Ngene?
5) Values of the parameter estimates: As I want to employ a rank-ordered mixed logit model, I had a look into the section "estimating random parameters models". It says that the parameters need to be defined as distributions. However, I do not know much about the parameters except for their sign so far.
5.1) Considering this, would you suggest to conduct a small pilot study in order to get a better idea about the necessary inputs for the normal or uniform distributions?
5.2) If so, how would you generate the design for the pilot study?
5.3) Is it also necessary to define the interaction parameters as distributions (I want to analyze 5-10 two-way interaction effects)?
6) Internet survey: Can you recommend any internet survey software that is particularly compatible with Ngene? Maybe Sawtooth?
Sorry for all these questions. I thought collecting them is better than asking many follow-up questions.
I would be very pleased to receive your answers / opinions on these issues.
Thank you in advance and kind regards
Michael
First, I would like to thank you for developing Ngene and for releasing such a detailed user manual! It not only explains the software well, but also provides more insights on experimental design theory for SC experiments than most of the literature. I have read it on the weekend and have a much better understanding of SC experiments now. Before continuing with some information about my research context and my inquiries, I would like to warn you that I’m very new to discrete choice analysis and therefore want to apologise in advance for any silly questions or mistakes.
My objective is to investigate whether entrepreneurs who raise equity have high preference heterogeneity concerning their investor choice. For this reason, I want to analyze the importance of investor characteristics as well as interaction (complementary and substitutability) effects between them.
Model specification: My idea is to present entrepreneurs 10 choice sets that comprise 3 alternatives (hypothetical investors 1, 2 and 3). Each alternative is described by 5 generic attributes (e.g., reputation or R&D-related support), each of which can take on 3 levels (low, medium, and high). I want to dummy code each generic attribute into two dummy variables indicating the deviation from the reference value. Participants are then asked to identify the best and the worst of the three hypothetical investors (each respondent thus makes 20 choices.). I found this approach in another paper and it has the advantage that I will obtain a complete ranking of alternatives for each choice set. As an estimation method to analyze such rank-ordered data, the authors in the mentioned paper applied a rank-ordered mixed logit model (also called random coefficient models). As this worked well for them, I want to follow them and also apply this model type. Considering all these specifications, I want to generate an efficient fractional-factorial design. In this context, I developed the following questions:
1) Coding scheme: I want to show 3 qualitative levels (e.g., low, medium, and high reputation). Is it correct to use design coding (e.g., A[0,1,2]) and to replace the numbers by the words «low, medium, and high» later on in the questionnaire?
2) Dummy coded attributes: In the mentioned paper, they used the value with the (presumably) lowest benefit as reference for each attribute to «ensure convenient interpretation of coefficient estimates». Let’s say I have the attribute level «having a low reputation» and I want to set this level (since it has the lowest benefit) as the reference. Considering page 111 in the manual, would I then have to relate the 2 in the term A[0,1,2] to the level «low» as the dummy variables (e.g., b1.dummy[1.2 | 2.2]) will automatically relate to the first two numbers/levels in the brackets?
3) Attribute level balance and algorithms: In the paper I found, the attribute levels with the (presumably) lowest benefit appeared 12x in the design each, whereas the other levels appeared 9x each. If I understand it correct, then this means that the attribute level balance property was satisfied, right? I would prefer a similar setup to ensure that the parameters can be estimated well on the whole range of levels. Is this a problem considering that row based algorithms like the Modified Federov algorithm are suggested for unlabelled designs like mine?
4) Rp vs. ec vs. rpec: In the paper I found (sorry for again refering to it), they also generated their efficient design with Ngene. However, I do not understand why they apply a different utility function (compared to the ones described in the manual). They modeled the utility of alternative j in choice set t for respondent n as a quadratic additive function of the alternative's characteristics, described by the vector xnjt. This function contained a linear term ßnxnjt, and a quadratic term. In the latter, which captured the interaction effects, the symmetric matrix carried a coefficient for each interaction that they included. I think up to here it is identical. Then they state "Note that both the coefficient vector and the coefficient matrix are respondent-specific. The enjt are residual error terms that are assumed to be independently and identically distributed and to follow an extreme value distribution." If I understand it correct, this is a combined design (random parameters and error components) but the error components are not normally distributed (as suggested in the manual). Do you have an idea why this could make sense and whether it is possible to model it with Ngene?
5) Values of the parameter estimates: As I want to employ a rank-ordered mixed logit model, I had a look into the section "estimating random parameters models". It says that the parameters need to be defined as distributions. However, I do not know much about the parameters except for their sign so far.
5.1) Considering this, would you suggest to conduct a small pilot study in order to get a better idea about the necessary inputs for the normal or uniform distributions?
5.2) If so, how would you generate the design for the pilot study?
5.3) Is it also necessary to define the interaction parameters as distributions (I want to analyze 5-10 two-way interaction effects)?
6) Internet survey: Can you recommend any internet survey software that is particularly compatible with Ngene? Maybe Sawtooth?
Sorry for all these questions. I thought collecting them is better than asking many follow-up questions.
I would be very pleased to receive your answers / opinions on these issues.
Thank you in advance and kind regards
Michael